Download - degit.sam.sdu.dkdegit.sam.sdu.dk/papers/degit_13/c013_027.pdf · The WTO Trade Eﬁect Pao-Li Chang⁄ School of Economics Singapore Management University Singapore 178903 Myoung-Jae

The WTO Trade Effect

Pao-Li Chang∗

School of Economics

Singapore Management University

Singapore 178903

Myoung-Jae Lee†

Department of Economics

Korea University

Seoul, Korea

August 1, 2008

Abstract

This paper proposes to reexamine the trade effect of GATT/WTO based on non-

parametric econometric techniques. Our estimation framework uses the simplest gravity

model that explains bilateral trade volumes with country sizes and trade resistance, with-

out imposing parametric assumptions typically made in gravity-type econometric models.

Specifically, matching estimators coupled with permutation tests and signed-rank tests

are employed to obtain point and interval estimates. While observable covariates (country

sizes and determinants of trade resistance) are controlled for nonparametrically, unob-

servable selection bias is dealt with by a sensitivity analysis. The results suggest large

GATT/WTO effects on trade in both cross-sectional and time-series dimensions, and the

results hold regardless of the GATT/WTO indicators (defined by formal membership or

de facto participation).

Keywords: GATT/WTO; treatment effect; matching; permutation test; signed-rank test;

sensitivity analysis

JEL classification: F13; F14; C14; C21; C23

∗Tel: +65-6828-0830; fax: +65-6828-0833. E-mail address: [email protected] (P.-L. Chang).†Tel: +82-16-347-7867; E-mail address: [email protected]

1

1. INTRODUCTION

This paper joins the debate on the trade effect of GATT/WTO that was sparked off by

Rose’s (2004). In that study, Rose did an extensive statistical analysis of bilateral trade

1948-99, based on gravity-type econometric methods, to find that GATT/WTO-membership

dummies fail to reveal any statistically significant, and robust influence on the volume of

bilateral trade. Rose’s findings have been partly reversed by more recent studies: see, for

example, Tomz et al. (2007) and Subramanian and Wei (2007), and Rose (2006) for a survey.

Most studies in this literature, however, primarily rely on parametric estimation of gravity-

type equations. There are by now several theoretical papers justifying the gravity model, in

which the volume of trade between two countries is proportional to the product of their

economic size, and the factor of proportionality depends on measures of ‘trade resistance’

between them. See, for example, Anderson (1979), Bergstrand (1985), Deardorff (1998), and

Anderson and van Wincoop (2003). Although we may specify the set of factors that affect

trade resistance, it takes a great leap of faith to believe that these determinants of trade

resistance sum up log-linearly (and the error term associated with trade resistance is nicely

distributed as log-normal) as most empirical specifications of gravity equations assume. Al-

though gravity equations have gained popularity in empirical research due to its good fit

to data, this good prediction power does not necessarily imply correct model specifications.

This is partly evident in many applications of gravity-type models that result in contradic-

tory conclusions on particular determinants of trade resistance, such as the controversy in the

literature over the currency union effect (Persson, 2001; Rose, 2001), the free trade agreement

effect (Frankel, 1997; Baier and Bergstrand, 2007b), the national border effect (McCallum,

1995; Anderson and van Wincoop, 2003), and the GATT/WTO membership effect.

In addition to potential biases arising from arbitrary functional form assumptions of trade

resistance, another potential bias in gravity estimates arises from the fact that policies that

affect trade flows are potentially endogenous or self-selected. Examples include the formation

of currency unions (Persson, 2001) and that of free trade areas (Baier and Bergstrand, 2007b).

Similarly, the status of GATT/WTO membership is likely not exogenous. Whether countries

that trade more or that trade less are more likely to join the GATT/WTO (and hence

the direction of selection bias) is a priori unknown, although a researcher may reasonably

suspect a country’s accession decision to depend on trade resistance among other factors. For

2

example, Baier and Bergstrand (2004) find that the same factors that determine trade flows

tend to determine the formation of free trade areas as well. Although parametric approaches

exist to correct the selection bias, they are subject to misspecification pitfalls.

In this paper, we propose nonparametric techniques that do not assume correct specifi-

cation of ‘trade resistance’ and allow ‘selection on observables.’ Specifically, a pair-matching

estimator is used in the first step to estimate the ‘treatment’ effect of membership in the

GATT/WTO. Among many nonparametric approaches to evaluating the treatment effect of

a policy or program, the pair-matching method is relatively straightforward to implement.

Matching is widely used in the fields of labor and health economics. See, for example, Heck-

man et al. (1997), Imbens (2004), and Lee (2005), and applications in Heckman et al. (1998),

Lechner (2000), Lee et al. (2007), and Lu et al. (2001). Recently, different versions of match-

ing methods have also been applied to the study of the currency union effect (Persson, 2001)

and the free trade agreement effect (Baier and Bergstrand, 2008).

Applying the terminology in the matching literature to our current context, the state of

both countries being GATT/WTO members in a trading relationship is a ‘treatment,’ and

the dyad in the state is a ‘treated’ subject, in contrast with a ‘control’ subject where both

countries are nonmembers. In essence, a pair-matching method compares the ‘responses’

(trade flows) of a treated subject and a control subject who differ in their treatments but

are otherwise similar. The difference in their bilateral trade flows is then attributed to

the treatment effect of GATT/WTO membership. As argued earlier, we do have strong

theoretical foundations in the basic gravity equation that relates trade flows to economic size

and trade resistance, although we do not have a strong conviction in the functional form

of trade resistance. On the other hand, the empirical gravity literature has more or less a

consensus on the list of variables that are likely to have a bearing on trade resistance. This

set of variables serve as the set of covariates that we need in measuring the similarity between

any two subjects and in matching them. In particular, we use the same set of covariates as

in Rose (2004) to allow comparison with other studies in this literature that mostly use the

same set of trade resistance determinants (along with economic sizes and year dummies). By

using the matching method, however, we do not have to assume a particular functional form

for trade resistance; that avoids the bias stemming from parametric misspecifications.

A second advantage of using matching estimators as claimed above is that the match-

ing estimator is suited for non-random selection into treatment (i.e. membership in the

3

GATT/WTO). The absolute probability of selecting into membership is allowed to vary

across trading relationships and be different from 0.5 (i.e. random selection). Matching of

two subjects similar in economic sizes and in determinants of trade resistance removes the

difference in absolute selection probabilities based on observable covariates and minimizes the

bias stemming from selection on observables. By using matching, similarly, we do not have

to assume knowledge of the exact functional form for accession decision or for the probability

of selecting into GATT/WTO membership.

Although matching estimators are popular in practice, their asymptotic properties are not

fully understood (see, however, exceptions such as Abadie and Imbens (2006) for the case of

iid data). In practice, a standard t-statistic or a bootstrap procedure is often used to derive

the p-value or confidence intervals. Standard t-statistic is straightforward but theoretical

justifications in most cases are not available; bootstrap is computationally demanding and

is argued by Abadie and Imbens (2006) to be invalid. In this paper, we propose using

permutation tests and for good reasons. As noted above, asymptotic-based tests for matching

estimators have yet to be well developed. Permutation test, on the other hand, is an exact

inferential procedure conditional on the observed data. Under the null hypothesis of no

treatment effect, two subjects in a matched pair are ‘exchangeable’ in the labeling of their

treatment status (treated or untreated). Obtaining all possible permutations of the treatment

labels in all pairs, the exact p-value of the observed matching estimator can be computed by

placing it in the “empirical” distribution of the permutations. When the number of matched

pairs is large, we show that the exact p-value can be approximated by the standard t-test and

one can skip the computation time required for the permutation. Thus, implementation-wise,

the permutation test is straightforward (this also offers a theoretical justification for the use

of standard t-statistic albeit from an exact-inferential perspective). Furthermore, as will be

argued in the main text, permutation tests can accommodate a wide range of data structure

with heterogeneity and serial correlation, which are likely to be present in bilateral trade

data. To our knowledge, permutation tests have not been used in international economics

applications. See Pesarin (2001) and Good (2004) for an introduction to the concept, and

Ernst (2004) for a survey. Recently, Imbens and Rosenbaum (2005) showed that confidence

intervals constructed by permutation methods are far more reliable than other confidence

intervals in the returns-to-education literature.

In applying the matching estimator and the permutation test, one first uses the observed

4

numerical differences in trade flows across matched pairs to estimate the treatment effect and

then find its statistical significance using the permutation test. Alternatively, one may use

the signed ranks of these differences and derive the corresponding permutation distribution

of the Wilcoxon (1945) signed-rank test statistic. The derived treatment effect estimator has

the extra advantage of being robust to outliers. The signed-rank test statistic is also readily

amenable to a sensitivity analysis procedure we will introduce below.

As noted earlier, self-selection into GATT/WTO membership is not a problem in the

matching framework, if the selection is based on observable variables. However, if the se-

lection is also based on some unobservables that vary systematically across the treated and

control groups and if the unobservables also affect trade flows, then selection bias due to

unobservables arises. In other words, the treatment is not random conditional on the ob-

servables. This problem of ‘selection on unobservables’ applies to both parametric gravity

estimation and matching estimation. We propose using a sensitivity analysis following Rosen-

baum (2002) to evaluate the sensitivity of matching estimates obtained under the assumption

of no ‘selection on unobservables.’ This method examines how severe the unobserved selec-

tion problem must be to eliminate an original finding of positive (or negative) effects or to

overturn an original finding of zero effect. Similar sensitivity analyses have appeared in other

fields of economics; see, for example, Aakvik (2001), Imbens (2003), Hujer et al. (2004),

Altonji et al. (2005), and Lee et al. (2007).

Applying the nonparametric matching methods and permutation-based inference proce-

dures to Rose’s data, this paper reaches a conclusion that is in stark contrast to conclusions

reached by Rose (2004): membership in the GATT/WTO has large and significant trade-

promoting effects. We explore robustness of this result to various possible critiques. First,

the assumption of no ‘selection on unobservables,’ i.e., “random treatment conditional on the

observables” may fail if the set of observable variables that we use following Rose (2004) is

not appropriate. This is partly addressed by the Rosenbaum (2002) sensitivity analysis intro-

duced above. Alternatively, we also conduct restricted matching, where a potential match for

a subject is further limited to observations from the same dyad, the same year/period, or the

same relative development stage. This eliminates potential bias arising from unobservable

heterogeneity across dyads, years, GATT/WTO periods, or countries of different develop-

ment stages. Second, Tomz et al. (2007) emphasize the importance of de facto participation

in the GATT/WTO by colonies, newly independent nations, and provisional members, and

5

find strong GATT/WTO effects on trade when these types of nonmember participation is

taken into account. We conduct the same nonparametric analysis introduced above using the

data set of Tomz et al. (2007) and find even stronger results than those based on Rose’s data

set, validating the arguments of Tomz et al. (2007). Third, relative trade resistance rather

than absolute trade resistance is argued by some gravity theories to be more appropriate in

explaining trade flows, cf. Anderson and van Wincoop (2003), and multilateral resistance

terms resembling country-specific effects should be included in the list of explanatory vari-

ables. In a way, we have accounted for dyad-specific and hence country-specific effects when

we conduct the matching within the same dyad as introduced above. The strong effects of

GATT/WTO membership (or participation) remain.1

Recent studies of Helpman et al. (2007) and Felbermayr and Kohler (2007) have empha-

sized the importance of zero trade flows and potential bias in the estimates of GATT/WTO

membership effect using only observations with positive trade flows in estimating the gravity

equation. In particular, an analysis in Helpman et al. (2007) implies that this will result in

a downward bias in the estimates of treatment effects, as dyads who are not GATT/WTO

members but still observed trading with each other are likely to have lower unobserved trade

resistance. Thus, regression analysis of gravity equations should include an extra term, i.e.

the inverse Mills’ ratio, to account for the effect of sample truncation at zero. By using Rose’s

data set, we also use only observations with positive trade flows. With observations matched

on observable determinants of trade flows, the difference in the conditional mean levels of

trade flows due to these determinants are eliminated, and the remaining difference in the con-

ditional means between the treated and the untreated observations in a matched pair thus

includes the true treatment effect and the effect of unobservable trade resistance. The argu-

ment above suggests that unobservable trade resistance is likely to be systematically lower

for the untreated than the treated and that the GATT/WTO membership effect likely to be

underestimated. Thus, the matching estimates we will present are conservative estimates of

the GATT/WTO membership effect. Our nonparametric procedure does allow the possibil-

ity of departure from the assumption of no ‘selection on unobservables,’ and evaluates such1On the other hand, in the matching framework, we do not have a good way to control for time-varying

country-specific effects as emphasized by some parametric studies of gravity equations, cf. Subramanian andWei (2007). For example, for the observation unit ijt with trade flows between countries i and j at time t,there is no other observation and hence potential match that also contains information on both country-specificeffects (φit, φjt) at time t. See, however, recent studies of Baier and Bergstrand (2007a, 2008) for suggestionsto approximate the multilateral terms.

6

effects using the Rosenbaum (2002) sensitivity analysis. If a positive effect is found under

the assumption of no ‘selection on unobservables,’ the above argument of negative selection

will only strengthen the initial finding of a positive effect.

Helpman et al. (2007) also distinguish the direct partial effect of trade resistance on trade

flows from their indirect effect on trade flows through changes in the number of exporters.

In this paper, we do not attempt to make this distinction. In our view, the larger trade flows

due to an increase in the number of exporters should also be considered as part of the benefit

of a GATT/WTO membership. Thus, the matching estimates we will present refer to the

total effect of GATT/WTO membership, including both the direct and indirect effects.

The rest of this paper is organized as follows. Section 2 introduces the nonparametric

procedures. Section 3 explains the data used. Section 4 summarizes the available theories of

GATT/WTO and caveats of using membership to define involvement in the GATT/WTO.

Our estimation results and findings are reported in Sections 5 and 6. Section 7 concludes.

Technical details are relegated to the appendix in Section 8.

2. METHODOLOGY

Suppose there are N countries in the world. Let yijt denote the real bilateral trade volume

in logarithm between countries i and j in year t. Define xijt and εijt as the observed and

unobserved variables for dyad (i, j) in year t, respectively, that can affect both bilateral

trade volume and probability of selecting into GATT/WTO membership. The vector of

covariates xijt typically includes proxies for economic sizes and observed determinants of

trade resistance, while εijt summarizes unobserved trade resistance. We investigate two types

of GATT/WTO membership treatment and their effects: the effect when both countries are

GATT/WTO members (both-in), relative to when both are nonmembers (none-in), and the

effect when only one country is a GATT/WTO member (one-in), relative to when both

are nonmembers (none-in). Rose (2004) found that the Generalized System of Preferences

(GSP) had a larger effect than GATT/WTO membership. To verify this result, we will also

investigate the effect of GSP treatment: the effect when one country in a bilateral trading

relationship is a GSP beneficiary of the other country or vice versa, relative to when neither

is a GSP beneficiary of the other.

Note that we will use the word ‘dyad’ for an observation unit that comprises two coun-

7

tries between which trade volume is recorded and ‘pair’ for two observation units which are

considered a match by the matching method discussed below.

2.1 Mean Effects and Matching

To facilitate exposition, focus on the effect of both-in treatment. Let the observed treatment

status be dijt for dyad (i, j) in year t, where

dijt = 1 if both in and 0 if none in.

The analysis for the effect of one-in or GSP treatment is analogous. Define two ‘potential’

response variables:

y1ijt : potential treated response for dyad (i, j) in year t,

y0ijt : potential untreated response for dyad (i, j) in year t.

We will denote ‘dyad (i, j) in year t’ simply ’subject ijt’ from now on. It may be a little

awkward, but we need to imagine that subject ijt has (dijt, y1ijt, y

0ijt, xijt, εijt). Yet we observe

only either y1ijt or y0

ijt depending on dijt = 1 or 0. The group of observations with dijt = 1

that reveal only (y1ijt, xijt) is called the treatment group, and the group with dijt = 0 that

reveal only (y0ijt, xijt) is called the control group.

The subject-ijt treatment effect is y1ijt − y0

ijt, which is, however, not identified, as only

one of the two potential responses is observed for a subject. Instead, a popular choice is to

estimate the mean effect E(y1 − y0), which is identified by the group mean difference

E(y|d = 1)− E(y|d = 0) = E(y1|d = 1)−E(y0|d = 0) = E(y1 − y0) if (y0, y1)q d

where ‘q’ stands for statistical independence. In other words, the selection into treatment is

random, as far as the trade volume (y0, y1) is concerned. If the randomization of treatment

applies only to the potential baseline response, i.e. y0 q d, then

E(y|d = 1)− E(y|d = 0) = E(y1|d = 1)− E(y0|d = 0) = E(y1|d = 1)−E(y0|d = 1)

= E(y1 − y0|d = 1), which is the ‘effect on the treated ’.

8

Analogously, if y1 q d, then

E(y|d = 1)− E(y|d = 0) = E(y1 − y0|d = 0), which is the ‘effect on the untreated ’.

Since

E(y1 − y0) = E(y1 − y0|d = 0)P (d = 0) + E(y1 − y0|d = 1)P (d = 1),

if the effect on the treated is the same as the effect on the untreated, then both equal

E(y1−y0). Otherwise, the representative effect E(y1−y0) is a weighted average of the effect

on the treated and the effect on the untreated, with the probability of being treated and

untreated as their respective weight.

In observational data, treatment is self-selected. The factors that determine GATT/WTO

accession decisions may also affect potential trade volumes. Matching on the observed vari-

ables (xijt) helps removing the “overt selection bias” caused by observed difference across the

treatment and control groups. In terms of equations, matching is expressed by including x in

the conditioning set in the preceding equations. The above group mean difference equation

becomes

E(y|d = 1, x)−E(y|d = 0, x) = E(y1|d = 1, x)−E(y0|d = 0, x) = E(y1−y0|x) if (y0, y1)qd|x,

and the E(y1 − y0)-decomposition equation becomes

E(y1 − y0|x) = E(y1 − y0|d = 0, x)P (d = 0|x) + E(y1 − y0|d = 1, x)P (d = 1|x).

Once the x-conditional effect is found, x can be integrated out to yield a marginal effect;

here, the integration can be done with different integrators (i.e., weighting functions). For

the effect on the treated, the distribution of x|d = 1 is typically used to render

E(y1 − y0|d = 1) =∫

E(y1 − y0|d = 1, x)dF (x|d = 1)

where F (·|d = 1) denotes the distribution of x|d = 1.

The condition (y0, y1)q d|x requires that any unobserved variables that may affect mem-

bership decisions as well as trade volumes do not vary systematically across the treatment

and control groups. In other words, matched subjects in a pair are on average compara-

9

ble in their potential trade performance and their probability of selecting into GATT/WTO

membership. Thus, the observed untreated response of the untreated subject can be used

as a proxy for the counterfactual potential untreated response of the treated subject, and

the observed treated response of the treated subject as a proxy for the counterfactual poten-

tial treated response of the untreated subject. The condition (y0, y1) q d|x is effectively an

assumption of no “hidden selection bias” caused by systematic unobservable heterogeneity

across the treatment and control groups. While any effect of observable trade determinants

x on trade volumes and membership decisions is dealt with (and the overt bias removed) by

matching, there is no good way to deal with unobservable trade resistance ε, for it is unob-

served. Nevertheless, a sensitivity analysis following Rosenbaum (2002) can be conducted to

account for the difference in ε across the groups (and potential hidden biases); see Section 2.4.

In general, the effect on the treated is of greater interest or policy relevance than the

overall effect or the effect on the untreated. For example, GSP are preferences extended by

higher-income countries to less-developed countries. It is not relevant to estimate the effect

on the untreated, as GSP does not apply to all kinds of trading relationships (for example,

that between two poor countries). For the GATT/WTO effect, both the effect on the treated

(i.e., those who chose to join GATT/WTO) and the effect on the untreated (i.e., those who

chose not to join) seem of interest. Exactly for the unknown nature of the treatment effect

and its unknown dependence on unobserved variables, in general, it is more plausible for

the assumption y0 q d|x (required to identify the effect on the treated) to hold than y1 q d|x(required to identify the effect on the untreated). For example, while some unobservable trade

resistance may not affect the baseline trade volume, it may affect the membership decision

and the trade volume once the dyad are both in the multilateral institution. Heterogeneity

in this unobservable variable across the treatment and control groups does not invalidate

the assumption y0 q d|x as the variable is irrelevant for the baseline trade volume y0, but it

does invalidate y1 q d|x. Hence, we will place more emphasis on the estimation result of the

membership effect on the treated than the effect on the untreated or the mean effect for all.

So far, we introduced the treatment effect framework and the matching concept. In

practice, matching for the effect on the treated proceeds in the following steps (the effect

on the untreated can be handled analogously). First, a treated subject, say subject ijt, is

selected. Second, control subjects are selected who are the closest to the treated subject ijt in

terms of x. If only one control subject is selected, we get ‘pair-matching’; otherwise, ‘multiple-

10

matching.’ Since our conditioning variables x include continuous variables (cf. Section 3), the

likelihood of multiple-matching is negligible. We restrict our attention to the case of pair-

matching. Third, given pair-matching, suppose there are M pairs (where M is the number

of treated subjects that successfully find a match), and ym1 and ym2 are the trade volumes

of the two subjects in pair m ordered such that ym1 > ym2 without loss of generality. Then,

defining

sm = 1 if the first subject in pair m is treated and − 1 otherwise,

the effect on the treated can be estimated with a pair-matching estimator

D ≡ 1M

M∑

m=1

sm(ym1 − ym2) →p E(y1 − y0|d = 1) under y0 q d|x, (1)

which is simply the average of the difference between the treated trade volume and the

untreated trade volume across the M matched pairs, with the unobserved baseline trade

volume for the treated subject estimated by the observed trade volume of a comparable

control subject.

Some remarks are in order. First, the matching scheme can be reversed to result in

an estimator for the effect on the untreated: a control subject is selected first and then

a matching subject from the treatment group later. Second, a measure of the distance

‖xijt − xi′j′t′‖ between two subjects ijt and i′j′t′ in terms of x must be chosen. We use the

simple scale-normalized distance between xijt and xi′j′t′ . Third, for a treated subject, if there

is no good matching control, then the subject may be passed over; i.e., a ‘caliper’ c may be

set such that a treated subject with

mini′j′t′∈C

‖xijt − xi′j′t′‖ > c, where ‘i′j′t′ ∈ C’ means unit i′j′t′ in the control group C,

is discarded. The number of matched pairs M used in the matching estimator D decreases

accordingly. For more discussions on treatment effect and matching in general, see, for

example, Rosenbaum (2002) and Lee (2005).2

2This paper uses the multivariate distance matching (MDM), while Persson (2001) uses propensity scorematching (PSM) in studying the currency union effect. There is no set rule in the literature on choosingbetween the two. We choose the former for its simplicity. But see Rubin and Thomas (2000) for a recommen-dation of combining the two methods. In our case, the results we obtain based on MDM are very strong, sowe doubt that the findings will change with a different matching scheme.

11

2.2 Permutation Test for Matched Pairs

The test of no treatment effect may take on at least three different levels of interpretations,

from the strongest to the weakest: that there is zero effect for all subjects, y1ijt−y0

ijt = 0, ∀ijt,that there is zero effect on the distribution of trade volumes, F (y0

ijt, y1ijt) = F (y1

ijt, y0ijt), or

that there is zero mean effect, E(y1ijt− y0

ijt) = 0. In this paper, the permutation test we base

our inferences on invokes the second concept of exchangeability. Under the null hypothesis H0

of no effect, the treated and the untreated trade volume are exchangeable without affecting

the joint distribution of the two.

Suppose that matching has been done resulting in M pairs. Under the null hypothesis,

exchangeability implies that, given the data, all 2M possibilities to permute the two responses

in each pair are all equally likely with probability 2−M in the ‘permutation sample space.’

Here, permuting two responses in each pair is equivalent to assigning one response to the

treatment group and the other to the control group. As the concept does not invoke large-

sample theories but is conditional on the observed data, permutation inference is an exact

inferential procedure. Whether H0 is rejected or not depends on how extreme D is in the

permutation distribution—i.e., the p-value of D. In theory, this can be computed exactly as

12M

2M∑

k=1

1[Dk > D] if the H0-rejection region is in the upper tail

where Dk is a “D-like” effect estimator under permutation k.

When M is large (as is the case in our current applications), the number of potential

permutations is huge. In Appendix 8.1, we articulate how to approximate the exact p-value

of D when M is large. This is done by simulating a subset of permutation possibilities from

the complete permutation space. In the appendix, we also show that when M is large, the

exact p-value of D can be approximated by a normally distributed standard t-statistic. For

example, if the H0-rejection region is in the upper tail, it follows that

P (Dk > D) ' P

{N(0, 1) >

D

{∑Mm=1(ym1 − ym2)2/M2}1/2

}, (2)

Thus, permutation test is straightforward to implement, especially when M is large and the

normal approximation is appropriate. Incidentally, the result above also provides a theo-

retical justification for the common practice of using standard t-statistics in evaluating the

12

significance of matching estimators, despite the fact that it is derived from an exact inferential

perspective. We also introduce in the appendix how to obtain the confidence interval (CI) for

the matching estimator D by “inverting” the permutation test procedure (cf. Lehmann and

Romano, 2005). See Martiz (1995), Hollander and Wolfe (1999), and Lehmann and Romano

(2005) among many others, for more on permutation (or randomization) tests in general.

The use of permutation-based tests, in stead of asymptotic tests, is especially convenient

in the current context with a panel of bilateral trade data, which possibly have a complicated

data structure with serial and spatial dependence, rendering the derivation of the asymptotic

distribution of the matching estimator difficult, if not impossible. By relying on exchange-

ability as the null hypothesis of no effect, the permutation test can accommodate potentially

a wide range of data structure. For example, if the joint distribution of the treated and

untreated responses in matched pairs is normal, exchangeability holds if the x-conditional

variances of the two responses are the same. This allows for any form of correlation between

a treated response and an untreated response in any matched pair and across matched pairs,

and any form of heteroskedasticity across matched pairs.

2.3 Signed-Rank Test for Matched Pairs

In this section, we propose a robust version of the matching estimator and the permutation

test introduced above. Instead of relying on actual numerical differences in trade volumes, the

signed-rank test (cf. Wilcoxon, 1945) uses only the signed rank of these observed differences.

This renders the test more robust to outliers in x and y. Incidentally, the signed-rank test is

distribution-free; thus, findings based on the test hold unconditionally as well, which accords

external validity.3 In addition, the signed-rank test is easily amenable to the Rosenbaum

(2002) sensitivity analysis we are proposing in Section 2.4.

Applying the Wilcoxon (1945) signed-rank test to the current matching context, rank

|ym1 − ym2|, m = 1, ..., M , to denote the resulting ranks as r1, ..., rM . For instance, when

M = 3,

|y11 − y12| = 0.3, |y21 − y22| = 0.09, |y31 − y32| = 0.21 =⇒ r1 = 3, r2 = 1, r3 = 2.3The permutation test in Section 2.2 is only ‘conditionally distribution-free’—the permutation distribution

under the null hypothesis is conditional on (x′, y, d), which enters the variance V (D′k). Due to this conditioning,

any finding from the test is applicable only to the sample at hand. That is, the finding has ‘internal validity,’but not ‘external validity.’

13

The signed-rank test statistic is the sum of the ranks for the pairs where the treated subject

has the higher response:

R ≡M∑

m=1

rm1[sm(ym1 − ym2) > 0] =M∑

m=1

rm1[sm = 1].

The inference based on the signed-rank R statistic can similarly follow the permutation

inferential procedure. The permutated version R′ for R is

R′ ≡M∑

m=1

rm1[wmsm(ym1 − ym2) > 0] =M∑

m=1

rm1[wmsm > 0]

=M∑

m=1

rm(1[wm = 1, sm = 1] + 1[wm = −1, sm = −1]),

where wm, m = 1, ..., M , are iid random variables with P (wm = 1) = P (wm = −1) = 0.5.

In other words, if wm = −1, the two responses in pair m are switched (the treated response

assigned to the control group, and the untreated response to the treatment group); otherwise,

no switch. Observe

E(1[wm = 1, sm = 1] + 1[wm = −1, sm = −1]) =1[sm = 1]

2+

1[sm = −1]2

=12.

V (1[wm = 1, sm = 1] + 1[wm = −1, sm = −1]) =14.

Hence, because rm’s are fixed (the only thing “random” is permutation, i.e., artificial treat-

ment assignment wm), under the H0,

E(R′) =M∑

m=1

rm12

=12

M∑

m=1

rm =M(M + 1)

4,

V (R′) =M∑

m=1

r2m

14

=14

M∑

m=1

r2m =

M(M + 1)(2M + 1)24

.

When M is large, the null distribution of {R′ − E(R′)}/SD(R′) can be approximated by

N(0, 1). The normally approximated p-value for R is

P

{N(0, 1) >

R − M(M + 1)/4{M(M + 1)(2M + 1)/24}1/2

}. (3)

The CI’s for the treatment effect can be similarly obtained by inverting the test, following

14

similar arguments as in the permutation test, cf. Appendix 8.1. Conduct level-α tests with

different values of β using

Rβ − M(M + 1)/4{M(M + 1)(2M + 1)/24}1/2

, where Rβ ≡M∑

m=1

rmβ1[sm(ym1 − smβ − ym2) > 0] (4)

and rmβ is the rank of |ym1 − smβ − ym2|, m = 1, ..., M . The collection of β values that are

not rejected is the (1− α)100% confidence interval for β.

In contrast to the point effect estimator D, there is no point effect estimator available in

the signed-rank test. As a point estimator, one may take the Hodges and Lehmann (1963)

estimator, which is obtained by solving for β such that

Rβ =M(M + 1)

4{= E(R′)}. (5)

That is, after transforming the data with the effect estimate β, one obtains a transformed

signed-rank statistic Rβ that coincides with the mean of the statistic under the null.

2.4 Sensitivity Analysis with Signed-Rank Test

As discussed in Section 2.1, the identification of the mean treatment effect relies on the

condition (y0, y1) q d|x (or the weaker version y0 q d|x for the effect on the treated). This

condition implies that there is no systematic unobservable heterogeneity across the treatment

and control groups that affects GATT/WTO membership status as well as trade volumes,

once differences in observable trade determinants x are removed by matching. In practice,

hidden selection bias may result from an inappropriate choice of covariates x that omits

variables that should be included. Alternatively, missing or truncated data may also introduce

another source of bias. For example, not all dyads are observed in all years; this is a missing

variable problem, which is also a selection problem. As discussed in the introduction, the use

of only observations with positive trade flows could lead to a downward bias of the treatment

effect estimate, cf. Helpman et al. (2007). Regardless of the source for hidden bias, in theory,

there is no good way to correct the hidden selection bias caused by unobservables, without

making further assumptions on the nature of the bias. Rather, one may approach this issue

from a different perspective by asking instead: how serious must the hidden selection bias be

(and in a certain direction) to overturn the original conclusion. In this section, we introduce

15

such a sensitivity analysis (cf. Rosenbaum, 2002) that accounts for an unobserved confounder

ε that might affect the treatment status d.4 We introduce the concepts below.

When two subjects in a matched pair differ in ε that influences d, their absolute proba-

bilities of taking the treatment are not the same. In particular, following Rosenbaum (2002),

suppose that the odds ratio of taking the treatment is:

1Γ≤ odds of one subject being treated

odds of the other being treated≤ Γ, for some constant Γ ≥ 1 ∀ pair.

For instance, if the first subject’s probability of taking the treatment is 0.6 (0.6) and the

second subject’s probability is 0.5 (0.4), then the odds ratio is 0.6/0.40.5/0.5 = 1.5 (0.6/0.4

0.4/0.6 = 2.25).

The sensitivity analysis goes as follows. Initially, one proceeds under the assumption of no

unobserved difference (Γ = 1). Then Γ is increased from 1 to see how the initial conclusion is

affected. If it takes a large value of Γ to reverse the initial finding, i.e., if only a strong presence

of ε can overturn the initial conclusion, then the initial conclusion is deemed insensitive to ε.

Otherwise, if it takes only a small value of Γ, then the initial finding is deemed sensitive.

The question is then how “large” is large for Γ. Suppose the initial conclusion is overturned

at the critical value Γ∗ = 2.25. Imagine that ε were observed. By observing ε, one would

be able to tell who is more likely to be treated between the two subjects and ask whether

the probability difference could result in as much as an odds ratio of 2.25. If the answer is

no, Γ∗ = 2.25 is a large value. Thus, in a way, how large is large for Γ depends on what is

included in x. If most relevant variables are in x and if it is hard to think of any important

omitted variables in ε, then even a small value of Γ may be regarded as large. In using similar

sensitivity analysis, Aakvik (2001) seems to regard Γ = 1.5 ∼ 2 as large, while Hujer et al.

(2004) appear to base their discussions on lower numbers of Γ = 1.25 ∼ 1.5.

The above sensitivity analysis can be applied to the signed-rank test in a straightforward

manner. Define

p+ ≡ Γ1 + Γ

≥ 0.5 and p− ≡ 11 + Γ

≤ 0.5.

Further define R+ (R−) as the sum of M -many independent random variables where the mth

variable takes rm with probability p+ (p−) and 0 with probability 1− p+ (1− p−). Writing4For ε to cause a bias to the treatment effect estimator at hand, ε should affect both d and (y0, y1). The

sensitivity analysis of Rosenbaum (2002) looks, however, only at the ε’s influence on d, essentially presumingthat ε affects (y0, y1) as well. Thus, the sensitivity analysis is a conservative one, for the bias may not bepresent if in fact ε does not affect (y0, y1).

16

R+ as∑M

m=1 rmum, where P (um = 1) = p+ and P (um = 0) = 1− p+, we get

E(R+) =M∑

m=1

rmE(um) = p+M∑

m=1

rm =p+M(M + 1)

2

V (R+) =M∑

m=1

r2mV (um) = p+(1− p+)

M∑

m=1

r2m =

p+(1− p+)M(M + 1)(2M + 1)6

.

Doing analogously, we obtain

E(R−) =p−M(M + 1)

2and V (R−) =

p−(1− p−)M(M + 1)(2M + 1)6

.

The means and variances with p+ and p− include E(R′) and V (R′) as a special case when

p+ = p− = 1/2 and two subjects in any matched pair have equal probability of taking the

treatment. Suppose that the H0-rejection interval is in the upper tail. Rosenbaum (2002,

p.111) shows that

P (R+ ≥ a) ≥ P (R′ ≥ a) ≥ P (R− ≥ a).

Using these bounds, the p-value obtained under the assumption of no hidden bias can be

bounded by P (R+ ≥ a) in case of rejection.5

Specifically, suppose that the H0-rejection interval is in the upper tail, and the no-hidden-

bias p-value is

P{N(0, 1) ≥ R− E(R′)SD(R′)

} = 0.001,

leading to the rejection of H0 at level α > 0.001. Rewrite P (R+ ≥ a) ≥ P (R′ ≥ a) as

P{R+ −E(R+)SD(R+)

≥ a− E(R+)SD(R+)

} ≥ P{R′ −E(R′)SD(R′)

≥ a− E(R′)SD(R′)

}

' P{N(0, 1) ≥ a− E(R+)SD(R+)

} ≥ P{N(0, 1) ≥ a− E(R′)SD(R′)

}.

Replacing a in the above equation with the realized R gives the p-value of R on the right-hand

side and its bound on the left-hand side. The left-hand side can be obtained for different

values of Γ. Then find, at which value of Γ, the upper bound crosses the level α. If this5In case of acceptance, P (R− ≥ a) shows the possibility of rejection when ε is taken into account. For a

two-sided test, the p-value gets multiplied by 2. For a lower-tail test, subtract the last display from 1 to get

1− P (R+ ≥ a) ≤ 1− P (R′ ≥ a) ≤ 1− P (R− ≥ a) =⇒ P (R+ < a) ≤ P (R′ < a) ≤ P (R− < a),

which can be used for bounds.

17

happens at, say Γ∗ = 2, then check whether or not Γ = 2 is a large value. If Γ = 2 is deemed

large, then the initial finding of a positive treatment effect is insensitive to the unobserved

difference.

The relevant distribution (R+ or R−) to use for calculating the bound in the sensitivity

analysis indicates the direction of hidden selection bias that would undermine an initial finding

of treatment effect or reverse an initial finding of no effect. For example, if the finding is a

significantly positive effect, we only need to worry about ‘positive’ selection, where a subject

with a higher potential treatment effect is also more likely to be treated; thus, the relevant

distribution is R+ that embodies selection bias in this direction. On the other hand, if the

finding is a zero or significantly negative effect, then ‘negative’ selection, where a subject with

a lower potential treatment effect is also more likely to be treated, can reverse or weaken the

original finding; in this case, the sensitivity analysis in the direction of R− is applicable.

3. DATA

We apply our proposed methodologies to both the data set of Rose (2004) and that of Tomz

et al. (2007). Readers are referred to Rose (2004) for a detailed account of his data set.6

The data set of Tomz et al. (2007) differs from Rose (2004) in its definitions of GATT/WTO

dummies.7 The GATT/WTO dummies are defined in terms of a dyad’s participation status

in the system, instead of its formal membership status.

As discussed in the introduction, there are by now strong theoretical foundations in the

basic gravity model that relates bilateral trade volumes to economic size and trade resistance,

although we do not have strong conviction in the functional form of trade resistance. On the

other hand, the empirical gravity literature has more or less converged to a list of variables

that likely have a bearing on trade resistance. This set of variables serve as the set of

covariates xijt that we need in measuring the similarity between two subjects. In particular,

we use the same set of covariates as in Rose (2004) to allow comparison with other studies

in this literature that mostly use the same set of trade resistance determinants (along with

economic sizes and year dummies).

These conditioning variables or covariates xijt include: (the natural logarithm of) the

6Rose’s data set is available from his Web site (http://faculty.haas.berkeley.edu/arose/GATTdataStata.zip).7There are other minor differences, including some corrections to Rose’s data, as explained in Tomz et al.

(2007, Foonote 32).

18

distance between countries i and j, (the natural logarithm of) the product of real GDP’s of

dyad (i, j) in year t, (the natural logarithm of) the product of real per capita GDP’s of dyad

(i, j) in year t, a binary variable indicating whether dyad (i, j) share a common language,

a binary variable indicating whether dyad (i, j) share a land border, a discrete variable

indicating the number of landlocked countries in dyad (i, j), a discrete variable indicating the

number of island nations in dyad (i, j), (the natural logarithm of) the product of land areas of

dyad (i, j), a binary variable indicating whether dyad (i, j) were ever colonies after 1945 with

the same colonizer, a binary variable indicating whether country i is a colony of country j in

year t or vice versa, a binary variable indicating whether country i ever colonized country j or

vice versa, a binary variable indicating whether dyad (i, j) remained part of the same nation

during the sample, a binary variable indicating whether dyad (i, j) use the same currency

in year t, a binary variable indicating whether dyad (i, j) belong to the same regional trade

agreement, and a list of year dummies for t = 1948, . . . , 1999.

The response variable yijt in our matching framework measures (the natural logarithm of)

the average value of real bilateral trade volumes between countries i and j in year t. Three

treatment effects are investigated: both-in, one-in, and GSP. The treatment dummy variable

dijt corresponds to bothinijt, oneinijt, and GSPijt, respectively. They measure, respectively,

whether both countries i and j are GATT/WTO members in year t, whether only one of

the two countries (i, j) is a GATT/WTO member in year t, and whether country i is a GSP

beneficiary of country j or vice versa in year t. When the GSP effect is being investigated, the

other two dummy variables (bothinijt and oneinijt) become part of the conditioning variables

xijt and are added to the list of covariates. On the other hand, when the both-in or one-in

effect is being investigated, GSPijt is used also as a covariate.

The data cover 178 IMF trading entities and span from 1948 to 1999 (with gaps). This

amounts to a total of 234,597 observations.

4. WHAT TRADE EFFECTS TO EXPECT OF

GATT/WTO MEMBERSHIP

On theoretical grounds, there are several economic models that lead one to expect a positive

both-in effect on trade. Among others, the terms-of-trade argument (Johnson, 1953–1954;

Bagwell and Staiger, 1999, 2001) suggests that multilateral trade agreements help coordinat-

19

ing countries’ trade policies and preventing them from engaging in tariff retaliations driven

by terms-of-trade incentives. With the coordinating mechanism of GATT/WTO, countries

undertake reciprocal trade liberalizations that increase their bilateral trade volumes while

leaving a neutral impact on their bilateral terms of trade. This argument is supported by

a recent empirical study by Broda et al. (2006). They show that non-WTO countries set

import tariffs in a manner that exploits their market power and shifts the terms of trade

to their advantage against their trading partners. Another political-commitment argument

(Staiger and Tabellini, 1987, 1989, 1999) suggests that multilateral trade agreements help

national governments to commit themselves to liberalized trade policies, given the retalia-

tion threat from other countries if the commitment is not carried out. This enhances policy

credibility with respect to domestic private sectors and brings about efficient production and

trade structure.

However, as noted by many in the literature, cf. Rose (2006), using membership to mea-

sure the GATT/WTO trade effect is noisy for several reasons. First, tariff reductions and

policy liberalizations do not necessarily coincide with the date of accession. Some coun-

tries may liberalize beforehand while some may liberalize gradually within a phase-in period.

Second, some GATT/WTO members may extend their most-favored-nation (MFN) treat-

ment to nonmember trading partners as well. Third, some countries (particularly developing

countries) did not liberalize their trade policies in spite of their membership in the GATT

(although this is less the case under the WTO). Fourth, some sectors (e.g. oils and minerals)

face little protectionism with or without the GATT/WTO, while some (e.g. agriculture) are

highly protected with or without the GATT/WTO. The above considerations imply that we

are likely to observe uneven both-in effects across trading relationships that are affected by

these factors to various extents, as emphasized by the work of Subramanian and Wei (2007).

For example, the theoretical both-in effect is likely reduced in practice between a developing

and a developed member where the former does not reciprocate the latter’s lower tariffs.

Their bilateral trade volume may still increase as the developing member gains access to the

developed country’s market, but is less than the theoretical both-in effect if the tariff reduc-

tion is reciprocal. The above considerations also suggest that the beneficial both-in effect

in theory will be blurred by these factors in practice. One may find a zero effect, if these

factors are prevalent. On the other hand, if the theoretical both-in effect is strong enough

that dominates the above factors, we will nonetheless find a positive effect from the data.

20

5. BENCHMARK RESULTS

Table 1 reports the results of the baseline methodology applied to the data set of Rose (2004),

labeled ‘unrestricted matching,’ where the pool of potential matches for an observation are

observations with the opposite treatment status; no further restriction is imposed.

We estimate the treatment effect of both-in, one-in, and GSP, respectively. For the effect

of both-in treatment, the observations with oneinijt=1 are dropped. This leaves a remain-

ing sample with observations where countries in a dyad are either both in the GATT/WTO

(114,750 observations) or both outside the GATT/WTO (21,037 observations). Similarly,

for the treatment effect of one-in, the observations with bothinijt=1 are dropped. The re-

maining sample includes only observations where either only one country in a dyad is in the

GATT/WTO (98,810 observations) or both are outside the GATT/WTO (21,037 observa-

tions). Finally, for the treatment effect of GSP, the sample consists of the whole sample;

a small proportion (54,285 observations) of the sample have GSP arrangements and most

(180,312 observations) do not.

In any given matching exercise, we set the caliper such that only 100%, 80%, 60%, or

40% of matched pairs obtained are qualified for the estimation of the treatment effect. The

number of matched pairs obtained is indicated by M1 for the effect on the treated and M0

for that on the untreated. With the caliper choice of 60%, for example, matched pairs with a

distance (in terms of x) exceeding the upper 60 percentile of all matched pairs obtained are

discarded. The caliper choice of 100% is equivalent to using all matched pairs obtained.

The results are grouped into three general columns: permutation test, signed-rank test,

and sensitivity analysis, following our discussions of these methodologies in Section 2. In

particular, two sets of treatment effect estimates are reported: the first set presenting the

D statistic along with its p-value and CI’s (cf. equations 1, 2, and 6, respectively), and

the second set presenting the Hodges and Lehmann statistic along with its p-value and CI’s

calculated based on the signed-rank R statistic (cf. equations 5, 3, and 4, respectively).8

The third column reports the results on the Rosenbaum (2002) sensitivity analysis for the

signed-rank test based on a significance level of α = 0.05 in a one-sided or two-sided test.8In Section 2.2, we introduced both simulation and normal approximations to calculating the exact p-value

of the D statistic. Although not mentioned explicitly in Section 2.3, in addition to the normal approximation, itis also possible to approximate the exact p-value of the signed-rank R statistic based on simulated permutationsamples. We carried out both approaches in all matching exercises and found almost identical results (whichis expected as the sample size is large). Thus, we report only the normally-approximated p-values andcorresponding CI’s in the tables.

21

Refer to the both-in treatment effect. We note that the both-in effect on the treated

is large and significant. Both sets of point effect estimates report similarly large positive

effects. For example, the point estimates imply that membership in the GATT/WTO by

both countries raises bilateral trade volume by 71% (= e0.535 − 1) to 279% (= e1.332 − 1) for

dyads who both chose to be in the GATT/WTO. The point estimates are all significantly

different from zero, regardless of the caliper choice, as indicated by their corresponding p-

values or confidence intervals.

How robust are these estimates to potential hidden selection bias? Since the finding is a

positive effect, only positive selection is a concern, where the treated subject that is observed

to have higher trade volumes on average than the untreated is also more likely to be both in

the GATT/WTO. Results of the sensitivity analysis indicate that the positive both-in effect

is robust to such positive selection (cf. R+) to the extent that the odds of the treated is not

more than 2.081 times the odds of the untreated to be both in the GATT/WTO (by the 80%

caliper and the two-sided test). The robustness is stronger with a one-sided test (naturally)

and with a larger caliper choice, and ranges from 1.467 to 2.434. These figures are not small

compared with typical critical thresholds (1.25 to 1.5) used in other literatures, cf. Aakvik

(2001) and Hujer et al. (2004). In addition to the Rosenbaum (2002) sensitivity analysis, we

shall discuss further refinements of the matching procedure in Section 6.1 to reduce potential

sources of unobservable heterogeneity across the treatment and control groups.

How about the potential effect on trade for dyads who chose not to join the GATT/WTO?

The estimates of both-in effect on the untreated suggest that bilateral trade volumes would

have increased by 15% (= e0.138 − 1) to 40% (= e0.337 − 1) if both countries in these dyads

were to join the GATT/WTO. The effects are smaller compared to the effect on the treated

but they are still significantly positive. At the same time, these estimates are also less robust

to potential hidden biases. Overall, the mean both-in effect on all trading relationships

is positive and significant, with the estimates ranging from 49% (= e0.399 − 1) to 224%

(= e1.175 − 1). They are reasonably robust to hidden biases.

As argued in Section 2.1, the identification condition (y0qd|x) for applying the matching

estimator to the effect on the treated is more likely to hold than that (y1 q d|x) for the

effect on the untreated. Thus, we will place less emphasis on estimates of the effect on the

untreated (and hence the effect on all, which includes the effect on the untreated) in the

following discussions, although their results will continue to be reported in the tables for

22

readers’ information.

We turn to the one-in effect. Unlike the both-in effect, where one may expect a positive

or a zero effect at worst, a priori, the one-in effect can take either sign. On one hand, import

diversion by the new member from its nonmember trading partner to other member trading

partners may lower the dyad’s bilateral trade volumes. On the other hand, in many cases,

when a country joins the GATT/WTO, its tariff reductions (and other policy liberalizations)

offered to existing members on a MFN basis are also extended to nonmember trading partners.

In this case, imports increase from all sources, including that of existing nonmember trading

partners. Furthermore, when a country gains access to markets of existing GATT/WTO

members with the newly acquired membership, it may increase imports of inputs necessary

for the production of exports to these destinations. Some of these additional imports may

fall on third nonmember countries. For example, with the accession into WTO, China may

increase imports of oil from Iran in its expansion of production and export activities. The

figures in Table 1 suggest that the one-in effect on the treated is overall positive; the estimates

range from 38% (= e0.325−1) to 117% (= e0.773−1). These estimates of the one-in effect are

smaller than the both-in effect but are positive and significant. Thus, our estimation results

suggest that the trade-creating effect of GATT/WTO membership dominates potential trade-

diverting effects when one country in a dyad unilaterally joins the GATT/WTO.

Relative to GATT/WTO membership, a preferential GSP scheme is also found to promote

bilateral trade, by a factor of 79% (= e0.581−1) to 134% (= e0.851−1) for trading relationships

that have applied the scheme in the past. These effect estimates are smaller than the both-

in effect and larger than the one-in effect overall. The finding of this ranking of the three

treatment effects seems reasonable. Since GSP are unilateral trade preferences extended only

from a higher-income country to its poor trading partners, its likely effect on bilateral trade

volumes is a priori smaller than if both the rich and the poor countries in a dyad lower their

import restrictions against each other, which happens presumably if both of them join the

GATT/WTO. On the other hand, any trade-promoting effect of one-in membership is, as

argued in the previous paragraph, indirect and conditional on the spillover of MFN treatment

and on the dyad’s initial trade pattern, while the effect of GSP is directly derived from a

straightforward reduction of dyad-specific trade resistance. As we shall see, this ranking

(both-in effect > GSP effect > one-in effect) holds in general regardless of refinements to the

matching procedures or variations in the data used (although the one-in effect is sometimes

23

larger than the GSP effect).

Readers may notice the relative large range of both-in effect estimates obtained with

different choices of calipers, in contrast with the relative narrow range of GSP effect estimates.

This, in a way, reflects likely uneven both-in effects across subjects as discussed in Section 4.

This is so that the lower bound of the both-in effect estimates can be smaller than that of

the GSP effect estimates. As we shall see, this is not always the case.

It may be also helpful to point out that the positive and stronger trade effect of both-in

is shared by a larger number of bilateral trading relationships (114, 750), than that of GSP

(54, 285). Thus, either on the average or in the aggregate, our estimation results suggest that

the realized trade-creating effect of GATT/WTO membership is larger than GSP.

For the GSP effect, we did not report the effect on the untreated, as GSP does not

apply to all kinds of trading relationships. For example, it is not relevant to propose a GSP

arrangement between two poor countries. On the other hand, the estimates of the GSP

effect on the treated reported above in Table 1 should be taken with a grain of salt, as the

conditioning set of covariates do not control for the development stages of the two countries

in a dyad. The existence of a GSP arrangement between a dyad and their bilateral trade

volumes are both very likely dependent on their relative development stage, which poses

potential selection bias. We shall see a refinement to the matching procedure in Section 6.1

to address this concern.

6. ROBUSTNESS CHECK

6.1 Restricted Matching

In this section, we refine the baseline matching procedure to address some potential sources

of selection bias that may remain conditional on the set of covariates x. Although we did

the Rosenbaum (2002) analysis in the previous section to assess the sensitivity of the bench-

mark results to whatever selection bias may remain, the analysis itself does not remove the

bias. In the literature, four potential sources of bias seem to be of major concern. They

are systematic unobservable heterogeneity across dyads, across years, across GATT/WTO

negotiating rounds, and across developing and developed countries. We refine the baseline

matching procedure and restrict the potential match for an observation to a subset of obser-

vations that have the opposite treatment status (as in the case of unrestricted matching) and

24

further satisfy the criterion that the match must also be from the same dyad, the same year,

the same time period defined according to GATT/WTO negotiating rounds, and the same

combination of relative development stage, respectively. By restricting the potential match

to the observations of the specified criterion (say, the same dyad), we are accounting for the

possibility that systematic unobservable differences exist (say, across dyads) that influence

bilateral trade volumes as well as decisions to join the GATT/WTO or decisions to extend

GSP. Restricted matching, say within the dyad, removes the likely dyad-specific effect on

trade as well as its effect on the selection decision into treatment.

Table 2 reports the results for ‘matching within dyad’ based on the data set of Rose (2004).

In this case, the pool of potential matches for an observation are restricted to observations

with the opposite treatment status and from the same dyad. Some dyads may be both

in the GATT/WTO, be both outside the GATT/WTO, or have only one of them in the

GATT/WTO throughout the sampling years (1948 to 1999). Alternatively, some dyads may

have only one of them in the GATT/WTO for some period of the sampling years and then

be both in the GATT/WTO throughout the rest of the sampling years. In these cases, these

dyads do not have qualified control subjects or treated subjects, and therefore are dropped

from the sample. This explains the much smaller sample size shown in Table 2. While the

current estimates suggest overall smaller treatment effects on the treated, the trade-creating

both-in effect continues to be economically and statistically significant, and larger than either

the one-in effect or the GSP effect. The estimates of the both-in effect on the treated range

from 114% (= e0.760 − 1) to 171% (= e0.996 − 1), the estimates of the one-in effect on the

treated range from 37% (= e0.314 − 1) to 70% (= e0.532 − 1), and that for GSP (in this case,

smaller than the one-in effect) from 30% (= e0.259 − 1) to 64% (= e0.492 − 1).

In Table 3, ‘matching within year,’ the matching is restricted to observations from the

same year and thus controls for possible year-specific effects. In this case, the year dummies

in x are redundant and are dropped from the list of covariates. The results come across as

almost the same as those in Table 1. This indicates that in the case of ‘unrestricted matching,’

the two subjects in a matched pair are often from the same year; thus, the estimates in Table 1

pick up mostly cross-sectional variations. This is not surprising, as the set of covariates x

in ‘unrestricted matching’ include year dummies, which encourages matching observations

of the same year. A further look into the data (not shown in the table) at every five-year

interval (1950, 1955, . . . , 1995) shows that the positive both-in or one-in effect is not lumpy in

25

a few particular years and are felt throughout the years, except 1975 and 1995, when there is

a dip in the membership effects. On the other hand, the GSP effect is low in 1970 (its early

phase) and 1995. The same pattern remains as the data set of Tomz et al. (2007) is used, cf.

Section 6.2.

In contrast with the estimates in Table 3 that measure cross-sectional (or ‘between’)

variations, the estimates in Table 2 with ‘matching within dyad’ measure time-series (or

‘within’) variations. Both ‘between’ and ‘within’ variations indicate that there are significant

gains in trade volumes by joining the GATT/WTO. Drawing a comparison with the finding

in the literature, readers may notice that the estimates presented above based on ‘within’

variations are overall smaller than the estimates based on ‘between’ variations. This seems at

odd with the results in Rose (2004), where he finds stronger effects with fixed-effect estimation

(which reflects ‘within’ variations in a panel framework) than with OLS estimation (which

reflects both ‘within’ and ‘between’ variations). We shall see that this relative ranking is not

universal and changes as the data set of Tomz et al. (2007) is used. It is also helpful to note

that using the same data set of Rose (2004), Subramanian and Wei (2007) also find smaller

‘within’ effects than ‘between’ effects.

The difference in ‘within’ and ‘between’ effects may reflect different theoretical effects

along the time-series and the cross-sectional dimensions. Alternatively, theoretical effects

may be the same along these two dimensions, but empirical factors as discussed in Section 4

obscure the theoretical effect to different extents in these two dimensions. For example, if

the phenomenon of liberalizations prior to or later than the date of accession is prevalent,

‘within’ estimates of the theoretical both-in effect based on the date of accession will be

biased downward. The earlier the advance or the longer the phase-in period, the stronger

the downward bias of the ‘within’ effect estimate. On the other hand, ‘between’ estimates

of the both-in effect based on formal membership are likely affected by the other problems

highlighted in Section 4. For example, comparison of two developing member countries

that do not make significant trade liberalizations with two other comparable developing

nonmember countries implies a zero both-in effect. The stronger the presence of these dyads,

the smaller the ‘between’ estimate of the overall both-in effect. A priori, it is difficult to say

which of the two effect estimates – ‘within’ or ‘between’ – is likely to be larger or smaller.

It is also understandable that the ranking of the two effect estimates can reverse with a

different definition of involvement in the GATT/WTO, if that changes the date of associated

26

treatment and the status of treatment for a significant number of observations.

Table 4 presents the results for ‘matching within period’ based on Rose’s data. The pool

of potential matches for an observation are restricted to observations with the opposite treat-

ment status and from the same period, where the periods are defined according to the various

rounds of trade negotiations sponsored by the GATT/WTO. They are: 1948 (Before Annecy

round), 1949-1951 (Annecy to Torquay round), 1952-1956 (Torquay to Geneva round), 1957-

1961 (Geneva to Dillon round), 1962-1967 (Dillon to Kennedy round), 1968-1979 (Kennedy

to Tokyo round), 1980-1994 (Tokyo to Uruguay round), and 1995-(After Uruguay round).

Given our earlier observations that matched pairs in the case of ‘unrestricted matching’ often

occur among cross-sectional subjects from the same year, it is no surprise to see that the

current estimation results are almost the same as in the case of ‘unrestricted matching’; the

extra criterion of matching within the same period in effect does not impose extra restriction

in most cases.

Table 5 reports the final set of results, ‘matching within relative development stage,’ based

on Rose’s data. In this matching exercise, the pool of potential matches for an observation are

restricted to observations with the opposite treatment status and from the same combination

of relative development stage. The combinations include: low-income/low-income dyads, low-

income/middle-income dyads, low-income/high-income dyads, middle-income/middle-income

dyads, middle-income/high-income dyads, and high-income/high-income dyads. Observa-

tions without a qualified match are discarded. The reason for imposing the extra criterion in

matching is that some may argue that the trade structure between a dyad of developed coun-

tries is likely to be systematically different from that between a dyad of developed/developing

countries (e.g. intra-industry trade versus inter-industry trade) and are not comparable in

terms of their potential trade volumes with or without the membership, even after control-

ling for gravity types of covariates. At the same time, the probabilities of being in the

GATT/WTO may vary systematically across development stages. In this case, matching

within the same relative development stage removes this source of potential bias in mem-

bership effect estimates. As argued earlier, similar critique applies to estimating the GSP

treatment effect and possibly more, given that GSP is directly dependent on a dyad’s relative

development stage. As the set of covariates in this exercise also include year dummies, we

compare the results here with those in Table 1 or 3. With the extra restriction on matching,

the estimates of the effect on the treated are smaller overall, although the relative ranking

27

remains the same. The current estimates suggest that membership raises bilateral trade by

a factor of 44% (= e0.364 − 1) to 208% (= e1.124 − 1) for dyads that both chose to be in the

GATT/WTO, and by a smaller factor 27% (= e0.239 − 1) to 92% (= e0.650 − 1) for dyads

where only one of them chose to be in the GATT/WTO. In comparison, GSP is estimated

to raise bilateral trade by a factor of 46% (= e0.378 − 1) to 108% (= e0.732 − 1).

Are the positive membership effects shared evenly among countries of different develop-

ment stages, or are they concentrated on particular subsets of countries? A further look into

the data (not reported in the table) shows that the positive effects are indeed concentrated on

dyads of middle-income/middle-income, middle-income/high-income, and high-income/high-

income countries. The low-income countries do not benefit much from a membership in

the GATT/WTO. Similar lumpy patterns were found in the work of Subramanian and Wei

(2007), although in the current work, we still find a positive average effect while they found

no positive average effect. This asymmetry may reflect two empirical concerns that major

export sectors (e.g. agriculture) of low-income countries still face steep protectionism from

the rich world with or without the GATT/WTO and that the low-income countries them-

selves do not significantly liberalize their import sectors despite their membership in the

GATT/WTO. Interestingly, the GSP effect also shows some lumpy patterns, where the posi-

tive effect is mostly driven by dyads that involve a high-income country. Similar observations

apply to participation instead of membership when the data set of Tomz et al. (2007) is used,

cf. Section 6.2.

How robust are the above results with restricted matching? When the matching criterion

becomes more stringent such that further potential sources of selection bias are minimized,

one may accept a lower critical threshold for Γ∗ than in the case of unrestricted matching, as

the remaining possibility of selection bias is lower. Both ‘matching within dyad’ and ‘match-

ing within relative development stage’ impose effective constraints relative to the benchmark.

In the latter case, the robustness of the treatment effect estimate is lower in general than the

benchmark. The tolerance threshold (Γ∗) for positive selection now stands at 1.256, 1.197,

and 1.530 for the lower-bound estimates of the both-in effect, the one-in effect, and the GSP

effect, respectively. The GSP estimate seems rather robust in this matching exercise, even

with the extra condition on matching within the same relative development stage, which is

argued above to be a desirable criterion for GSP estimation. Thus, we regard this set of GSP

estimates as our favorite ones based on the data of Rose (2004).

28

On the other hand, with the more stringent criterion of ‘matching within dyad,’ the

robustness of the membership effect estimates strengthens (Γ∗ = 2.503 at the minimum for

the both-in effect, and Γ∗ = 1.508 at the minimum for the one-in effect, cf. Table 2). Thus,

it is relatively comfortable for us to accept the ‘within’ estimates of the both-in and one-in

effects, as the tolerance level for hidden selection bias is higher despite the fact that the

possibility of remaining selection bias is lower with the extra matching criterion.

6.2 Participation instead of Formal Membership

What if the data set of Tomz et al. (2007) is used instead? In their work, Tomz et al. (2007)

stress the importance of de facto participation in the multilateral system by nonmembers

such as colonies, newly independent colonies, and provisional members. They share to a large

extent the same set of rights and obligations under the agreement as formal members. Tomz

et al. (2007) classify these territories as nonmember participants and define participation

to include both formal membership and nonmember participation. Without changing the

estimation framework of Rose (2004), they find significant participation effect on trade.

Tables 6 to 10 report the results based on the data set of Tomz et al. (2007). In this case,

the treatments are participation in the GATT/WTO by both countries in a dyad (both-p)

and by only one country in a dyad (one-p). Because when the GSP effect is estimated,

the participation status (bothpijt and onepijt) of a dyad, instead of their membership status

(bothinijt and oneinijt), is used as part of the conditioning covariates, the results for GSP

are not identical to those based on Rose’s data.9

It is fair to say that participation effects are overall stronger than membership effects on

trade. Both the ‘between’ estimates (cf. Tables 6, 8, 9) and the ‘within’ estimates (cf. Table

7) are larger than corresponding estimates obtained based on Rose’s data. They are also more

robust to hidden selection bias, as indicated by the sensitivity analysis. In particular, the

‘within’ estimates in Table 7 are so much larger that they are now larger than the ‘between’

estimates (cf. Section 6.1). The exception to this pattern of overall stronger participation

effects than membership effects occurs when the matching is restricted within the same

relative development stage. Among other explanations, it is possible that there is a stronger

positive selection into participation than into formal membership across development stages,9Tomz et al. (2007) also corrected some coding errors in the data set of Rose (2004), in particular, the income

status and geography indicator of some territories, which is another factor that may affect the estimation resultsof GSP.

29

such that the estimates of the participation effect drop by a larger extent than the estimates of

the membership effect (compared with their respective benchmark of unrestricted matching)

when this potential bias is removed by matching within the same relative development stage.

On the other hand, the estimates of the GSP effect are overall smaller based on the data

set of Tomz et al. (2007) than on that of Rose (2004), regardless of the matching criteria.

However, the difference is not large. This is possibly due to the fact that the change in

GATT/WTO indicators from membership to participation affects the GSP estimates only

indirectly through conditioning covariates that include many other variables.

In all, the current finding of an overall larger participation effect than membership effect

is consistent with the contrasting results found by Tomz et al. (2007) and by Rose (2004).

7. CONCLUSION

Since the publication of the work by Rose (2004), his finding of no significant GATT/WTO

effect on trade volumes has been partly reversed in a few dimensions: there may be nil effects

on average, but there are significant and positive effects for some subsets of countries, cf.

Subramanian and Wei (2007); there may be nil membership effects, but there are significant

and positive participation effects, cf. Tomz et al. (2007); there may be nil cross-sectional

effects, but there are some positive time-series effects, cf. Rose (2004, 2006); there may be

nil intensive effects, but there are some positive extensive effects, cf. Felbermayr and Kohler

(2007) and Helpman et al. (2007).

In this paper, we found that there are some uneven distributions of membership effects

across countries of different development stages, but on average, there are still significant and

positive membership effects; there are stronger participation effects than membership effects,

but membership effects are positive; there are positive (membership or participation) effects

along both cross-sectional and time-series dimensions. We recognize that our estimates based

on positive trade flows still capture only the intensive margin of GATT/WTO effects, but

the effects will only be strengthened with the extensive-margin argument. Thus, despite all

the caveats in the operation of the GATT/WTO institution and the noisiness of measuring

trade liberalizations by official membership or participation, the data do reveal an overall

positive effect on trade of being part of the multilateral system.

30

8. APPENDIX

8.1 Permutation Test for Matched Pairs

One may simulate a subset of permutation possibilities as follows. Suppose that matching has

been done resulting in M pairs. Generate M -many iid random variables wm, m = 1, ..., M ,

with P (wm = 1) = P (wm = −1) = 0.5. If wm = −1, the two responses in pair m are

switched (the treated response assigned to the control group, and the untreated response to

the treatment group); otherwise, no switch. For pseudo sample k obtained this way, obtain

the pseudo effect estimator

D′k ≡

1M

M∑

m=1

wmsm(ym1 − ym2)

where only wm’s are random. The p-value of D is then

1K

K∑

k=1

1[D′k > D] if the H0-rejection region is in the upper tail,

where K is a number much smaller than 2M , say K = 1000. In principle, a pseudo sample

should not be repeated: ‘sampling from the 2M possibilities without replacement’ is required.

In practice, the chance of a pseudo sample being repeated in K (= 1000) pseudo samples is

negligible and can be ignored. The proof is available from the authors upon request.

So far we showed that the exact p-value can be approximated by simulating K pseudo

samples. A further simplification is possible by applying the central limit theorem (CLT) to

wm’s. Observe

E(D′k) = 0 and V (D′

k) = E(D′2k ) =

1M2

M∑

m=1

E{w2ms2

m(ym1−ym2)2} =∑M

m=1(ym1 − ym2)2

M2.

When M is large, the normally approximated p-value of D is

P (D′k > D) = P{ D′

k

{∑Mm=1(ym1 − ym2)2/M2}1/2

>D

{∑Mm=1(ym1 − ym2)2/M2}1/2

}

' P{N(0, 1) >D

{∑Mm=1(ym1 − ym2)2/M2}1/2

},

if the H0-rejection region is in the upper tail. Using this normal approximation, one can skip

31

the computation time required of simulating wm’s and getting D′k, k = 1, . . . ,K.

As is well known, we can obtain a confidence interval (CI) by “inverting” the test proce-

dure (see for example, Lehmann and Romano, 2005). For instance, suppose that the treat-

ment effect is the same for all pairs: the effect increases the treated response by a constant,

say β. In this case, replace ym1 with ym1−β when sm = 1 or ym2 with ym2−β when sm = −1

to obtain

Dβ ≡ 1M

M∑

m=1

sm(ym1 − smβ − ym2)

and restore the no-effect situation. Define accordingly

D′β ≡

1M

M∑

m=1

wmsm(ym1 − smβ − ym2)

to observe

E(D′β) = 0 and V (D′

β) = E(D′2β ) =

∑Mm=1(ym1 − smβ − ym2)2

M2.

Now conduct level-α tests with

Dβ

{∑Mm=1(ym1 − smβ − ym2)2/M2}1/2

(6)

as β varies. The collection of β values that are not rejected is the (1 − α)100% confidence

interval for β.

REFERENCES

Aakvik, A., 2001. Bounding a matching estimator: the case of a Norwegian training program.

Oxford Bulletin of Economics and Statistics 63, 115–143.

Abadie, A., Imbens, G. W., 2006. Large sample properties of matching estimators for average

treatment effects. Econometrica 74 (1), 235–267.

Altonji, J. G., Elder, T. E., Taber, C. R., 2005. Selection on observed and unobserved vari-

ables: assessing the effectiveness of Catholic schools. Journal of Political Economy 113,

151–184.

32

Anderson, J. E., 1979. A theoretical foundation for the gravity equation. American Economic

Review 69 (1), 106–16.

Anderson, J. E., van Wincoop, E., 2003. Gravity with gravitas: A solution to the border

puzzle. American Economic Review 93 (1), 170–92.

Bagwell, K., Staiger, R. W., 1999. An economic theory of GATT. American Economic Review

89 (1), 215–248.

Bagwell, K., Staiger, R. W., 2001. Reciprocity, non-discrimination and preferential agree-

ments in the multilateral trading system. European Journal of Political Economy 17, 281–

325.

Baier, S. L., Bergstrand, J. H., 2004. Economic determinants of free trade agreements. Journal

of International Economics 64 (1), 29–63.

Baier, S. L., Bergstrand, J. H., 2007a. Bonus vetus OLS: A simple method for approximating

international trade-cost effects using the gravity equation, working paper.

Baier, S. L., Bergstrand, J. H., 2007b. Do free trade agreements actually increase members’

international trade? Journal of International Economics 71, 72–95.

Baier, S. L., Bergstrand, J. H., 2008. Estimating the effects of free trade agreements on trade

flows using matching econometrics, working paper.

Bergstrand, J. H., 1985. The gravity equation in international trade: Some microeconomic

foundations and empirical evidence. Review of Economics and Statistics 67 (3), 474–481.

Broda, C., ao, N. L., Weinstein, D. E., 2006. Optimal tariffs: The evidence. American Eco-

nomic Review, forthcoming.

Deardorff, A. V., 1998. Determinants of bilateral trade: Does gravity work in a neoclassical

world? In: Frankel, J. A. (Ed.), The Regionalization of the World Economy. University of

Chicago Press, Chicago, pp. 7–22.

Ernst, M. D., 2004. Permutation methods: A basis for exact inference. Statistical Science

19 (4), 676–685.

33

Felbermayr, G., Kohler, W., 2007. Does WTO membership make a difference at the extensive

margin of world trade? CESifo Working Paper No. 1898.

Frankel, J. A., 1997. Regional Trading Blocs. Institute for International Economics, Wash-

ington, DC.

Good, P. I., 2004. Permutation, Parametric, and Bootstrap Tests of Hypotheses, 3rd Edition.

Springer Series in Statistics. Springer.

Heckman, J., Ichimura, H., Smith, J., Todd, P., 1998. Characterizing selection bias using

experimental data. Econometrica 66 (5), 1017–1098.

Heckman, J. J., Ichimura, H., Todd, P. E., 1997. Matching as an econometric evaluation

estimator: Evidence from evaluating a job training programme. The Review of Economic

Studies 64 (4), 605–654.

Helpman, E., Melitz, M., Rubinstein, Y., 2007. Estimating trade flows: Trading partners and

trading volumes. Quarterly Journal of Economics, forthcoming.

Hodges, J., Lehmann, E., 1963. Estimates of location based on rank tests. Annals of Mathe-

matical Statistics 34, 598–611.

Hollander, M., Wolfe, D. A., 1999. Nonparametric statistical methods, 2nd Edition. Wiley.

Hujer, R., Caliendo, M., Thomsen, S. L., 2004. New evidence on the effects of job creation

schemes in Germany – a matching approach with threefold heterogeneity. Research in

Economics 58, 257–302.

Imbens, G. W., 2003. Sensitivity to exogeneity assumptions in program evaluation. American

Economic Review (Papers and Proceedings) 93, 126–132.

Imbens, G. W., 2004. Nonparametric estimation of average treatment effects under exogene-

ity: A review. The Review of Economics and Statistics 86 (1), 4–29.

Imbens, G. W., Rosenbaum, P. R., 2005. Robust, accurate confidence intervals with a weak

instrument: quarter of birth and education. Journal of the Royal Statistical Society (Series

A) 168, 109–126.

34

Johnson, H. G., 1953–1954. Optimum tariffs and retaliation. Review of Economic Studies

21 (2), 142–153.

Lechner, M., 2000. An evaluation of public-sector-sponsored continuous vocational training

programs in east germany. The Journal of Human Resources 35 (2), 347–375.

Lee, M.-J., 2005. Micro-econometrics for policy, program, and treatment effects. Oxford Uni-

versity Press.

Lee, M.-J., Hakkinen, U., Rosenqvist, G., 2007. Finding the best treatment under heavy

censoring and hidden bias. Journal of the Royal Statistical Society (Series A) 170, 133–

147.

Lehmann, E. L., Romano, J. P., 2005. Testing statistical hypotheses. Springer.

Lu, B., Zanutto, E., Hornik, R., Rosenbaum, P. R., 2001. Matching with doses in an observa-

tional study of a media campaign against drug abuse. Journal of the American Statistical

Association 96 (456), 1245–1253.

Martiz, J., 1995. Distribution-free statistical methods, 2nd Edition. Chapman&Hall.

McCallum, J., 1995. National borders matter: Canada-U.S. regional trade patterns. American

Economic Review 85 (3), 615–623.

Persson, T., 2001. Currency unions and trade: how large is the treatment effect? Economic

Policy 33, 435–448.

Pesarin, F., 2001. Multivariate Permutation Tests: With Applications in Biostatistics. Wiley.

Rose, A. K., 2001. Currency unions and trade: the effect is large. Economic Policy 33, 449–

461.

Rose, A. K., 2004. Do we really know that the WTO increases trade? American Economic

Review 94, 98–114.

Rose, A. K., 2006. The effect of membership in the GATT/WTO on trade: Where do we

stand? In: Drabek, Z. (Ed.), The WTO and Economic Welfare. forthcoming.

Rosenbaum, P. R., 2002. Observational studies, 2nd Edition. Springer.

35

Rubin, D. B., Thomas, N., 2000. Combining propensity score matching with additional ad-

justments for prognostic covariates. Journal of the American Statistical Association 95,

573–585.

Staiger, R. W., Tabellini, G., 1987. Discretionary trade policy and excessive protection.

American Economic Review 77 (5), 823–837.

Staiger, R. W., Tabellini, G., 1989. Rules and discretion in trade policy. European Economic

Review 33 (6), 1265–1277.

Staiger, R. W., Tabellini, G., 1999. Do GATT rules help governments make domestic com-

mitments? Economics & Politics 11 (2), 109–144.

Subramanian, A., Wei, S.-J., 2007. The WTO promotes trade, strongly but unevenly. Journal

of International Economics 72 (1), 151–175.

Tomz, M., Goldstein, J., Rivers, D., 2007. Do we really know that the WTO increases trade?

comment. American Economic Review 97 (5), 2005–2018.

Wilcoxon, F., 1945. Individual comparisons by ranking methods. Biometrics 1, 80–83.

36

Table 1: Rose (2004) data set – unrestricted matchingpermutation test signed-rank test sensitivity analysis

one-sided test two-sided testcaliper effect p-value 95% CI effect p-value 95% CI Γ∗ as in Γ∗ as in

‘Both in GATT/WTO’ treatment effecton the treated (M1 = 114, 750):100% 1.328 0.000 [1.307, 1.349] 1.332 0.000 [1.312, 1.351] 2.434 R+ 2.428 R+

80% 1.075 0.000 [1.052, 1.098] 1.075 0.000 [1.053, 1.096] 2.086 R+ 2.081 R+

60% 0.836 0.000 [0.810, 0.862] 0.835 0.000 [0.810, 0.859] 1.780 R+ 1.775 R+

40% 0.553 0.000 [0.522, 0.584] 0.535 0.000 [0.507, 0.563] 1.472 R+ 1.467 R+

on the untreated (M0 = 21, 037):100% 0.337 0.000 [0.296, 0.379] 0.303 0.000 [0.266, 0.342] 1.250 R+ 1.243 R+

80% 0.239 0.000 [0.192, 0.286] 0.200 0.000 [0.157, 0.241] 1.144 R+ 1.138 R+

60% 0.185 0.000 [0.131, 0.239] 0.138 0.000 [0.090, 0.187] 1.084 R+ 1.077 R+

40% 0.304 0.000 [0.239, 0.368] 0.243 0.000 [0.184, 0.301] 1.177 R+ 1.167 R+

on all (M1 + M0 = 135, 787):100% 1.175 0.000 [1.156, 1.193] 1.161 0.000 [1.143, 1.179] 2.209 R+ 2.205 R+

80% 0.899 0.000 [0.878, 0.919] 0.883 0.000 [0.863, 0.902] 1.858 R+ 1.854 R+

60% 0.636 0.000 [0.613, 0.659] 0.619 0.000 [0.597, 0.640] 1.559 R+ 1.555 R+

40% 0.428 0.000 [0.400, 0.455] 0.399 0.000 [0.374, 0.424] 1.342 R+ 1.338 R+

‘One in GATT/WTO’ treatment effecton the treated (M1 = 98, 810):100% 0.767 0.000 [0.746, 0.789] 0.773 0.000 [0.753, 0.792] 1.759 R+ 1.755 R+

80% 0.564 0.000 [0.540, 0.588] 0.568 0.000 [0.547, 0.589] 1.525 R+ 1.521 R+

60% 0.422 0.000 [0.396, 0.449] 0.428 0.000 [0.405, 0.451] 1.397 R+ 1.393 R+

40% 0.326 0.000 [0.296, 0.357] 0.325 0.000 [0.298, 0.351] 1.294 R+ 1.289 R+

on the untreated (M0 = 21, 037):100% 0.030 0.068 [-0.009, 0.069] 0.034 0.022 [0.000, 0.068] 1.006 R+ 1.001 R+

80% 0.092 0.000 [0.048, 0.135] 0.089 0.000 [0.052, 0.126] 1.057 R+ 1.051 R+

60% 0.078 0.001 [0.028, 0.129] 0.084 0.000 [0.041, 0.127] 1.046 R+ 1.039 R+

40% 0.138 0.000 [0.076, 0.201] 0.149 0.000 [0.096, 0.203] 1.102 R+ 1.094 R+

on all (M1 + M0 = 119, 847):100% 0.638 0.000 [0.619, 0.657] 0.632 0.000 [0.615, 0.649] 1.610 R+ 1.607 R+

80% 0.443 0.000 [0.422, 0.464] 0.437 0.000 [0.418, 0.455] 1.401 R+ 1.397 R+

60% 0.324 0.000 [0.301, 0.347] 0.321 0.000 [0.301, 0.340] 1.297 R+ 1.293 R+

40% 0.225 0.000 [0.198, 0.253] 0.220 0.000 [0.197, 0.243] 1.194 R+ 1.190 R+

GSP treatment effecton the treated (M1 = 54, 285):100% 0.851 0.000 [0.831, 0.871] 0.792 0.000 [0.774, 0.811] 2.277 R+ 2.269 R+

80% 0.757 0.000 [0.736, 0.778] 0.696 0.000 [0.676, 0.716] 2.125 R+ 2.117 R+

60% 0.693 0.000 [0.668, 0.717] 0.627 0.000 [0.604, 0.649] 1.998 R+ 1.990 R+

40% 0.665 0.000 [0.635, 0.696] 0.581 0.000 [0.553, 0.608] 1.879 R+ 1.869 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status; nofurther restriction is imposed. The number of matched pairs obtained is indicated by M1 for the effect on the treated,and M0 for the effect on the untreated.2. The data set of Rose (2004) is used; the set of conditioning covariates x used in matching are as explained in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

37

Table 2: Rose (2004) data set – matching within dyadpermutation test signed-rank test sensitivity analysis



80% 0.760 0.000 [0.727, 0.792] 0.814 0.000 [0.785, 0.844] 2.559 R+ 2.543 R+

60% 0.833 0.000 [0.797, 0.870] 0.876 0.000 [0.843, 0.910] 2.792 R+ 2.771 R+

40% 0.796 0.000 [0.750, 0.843] 0.816 0.000 [0.773, 0.858] 2.526 R+ 2.503 R+


80% 1.117 0.000 [1.067, 1.167] 1.161 0.000 [1.115, 1.207] 3.475 R+ 3.440 R+

60% 0.989 0.000 [0.931, 1.048] 1.019 0.000 [0.967, 1.071] 3.017 R+ 2.983 R+

40% 0.847 0.000 [0.777, 0.917] 0.840 0.000 [0.781, 0.899] 2.635 R+ 2.600 R+

on all (M1 + M0 = 29, 270):100% 1.058 0.000 [1.033, 1.082] 1.110 0.000 [1.087, 1.132] 3.515 R+ 3.496 R+

80% 0.895 0.000 [0.868, 0.923] 0.943 0.000 [0.918, 0.967] 2.927 R+ 2.911 R+

60% 0.935 0.000 [0.903, 0.967] 0.969 0.000 [0.940, 0.998] 3.021 R+ 3.002 R+

40% 0.873 0.000 [0.833, 0.913] 0.888 0.000 [0.852, 0.923] 2.700 R+ 2.679 R+


80% 0.403 0.000 [0.374, 0.432] 0.470 0.000 [0.444, 0.495] 1.782 R+ 1.772 R+

60% 0.371 0.000 [0.335, 0.407] 0.460 0.000 [0.428, 0.491] 1.666 R+ 1.656 R+

40% 0.314 0.000 [0.269, 0.359] 0.391 0.000 [0.351, 0.430] 1.520 R+ 1.508 R+


80% 0.463 0.000 [0.423, 0.503] 0.525 0.000 [0.492, 0.559] 1.818 R+ 1.805 R+

60% 0.386 0.000 [0.339, 0.432] 0.430 0.000 [0.390, 0.469] 1.610 R+ 1.597 R+

40% 0.317 0.000 [0.262, 0.372] 0.352 0.000 [0.306, 0.398] 1.479 R+ 1.465 R+

on all (M1 + M0 = 38, 645):100% 0.509 0.000 [0.488, 0.529] 0.574 0.000 [0.556, 0.592] 2.023 R+ 2.016 R+

80% 0.428 0.000 [0.404, 0.452] 0.492 0.000 [0.472, 0.513] 1.816 R+ 1.808 R+

60% 0.403 0.000 [0.374, 0.432] 0.478 0.000 [0.453, 0.503] 1.706 R+ 1.698 R+

40% 0.291 0.000 [0.256, 0.326] 0.339 0.000 [0.310, 0.368] 1.469 R+ 1.460 R+


80% 0.492 0.000 [0.479, 0.506] 0.489 0.000 [0.478, 0.501] 2.504 R+ 2.494 R+

60% 0.379 0.000 [0.363, 0.395] 0.371 0.000 [0.357, 0.385] 1.945 R+ 1.937 R+

40% 0.271 0.000 [0.250, 0.292] 0.259 0.000 [0.241, 0.277] 1.536 R+ 1.528 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status andfrom the same dyad; observations without a match are discarded. The number of matched pairs obtained is indicatedby M1 for the effect on the treated, and M0 for the effect on the untreated.2. The data set of Rose (2004) is used; the set of conditioning covariates x used in matching are as explained in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

38

Table 3: Rose (2004) data set – matching within yearpermutation test signed-rank test sensitivity analysis



80% 1.075 0.000 [1.052, 1.098] 1.075 0.000 [1.053, 1.096] 2.086 R+ 2.081 R+

60% 0.836 0.000 [0.810, 0.862] 0.835 0.000 [0.810, 0.859] 1.780 R+ 1.775 R+

40% 0.553 0.000 [0.522, 0.584] 0.535 0.000 [0.507, 0.563] 1.472 R+ 1.467 R+


80% 0.239 0.000 [0.192, 0.286] 0.200 0.000 [0.157, 0.241] 1.144 R+ 1.138 R+

60% 0.185 0.000 [0.131, 0.239] 0.138 0.000 [0.090, 0.187] 1.084 R+ 1.077 R+

40% 0.304 0.000 [0.239, 0.368] 0.243 0.000 [0.184, 0.301] 1.177 R+ 1.167 R+

on all (M1 + M0 = 135, 787):100% 1.176 0.000 [1.157, 1.194] 1.164 0.000 [1.146, 1.181] 2.209 R+ 2.205 R+

80% 0.899 0.000 [0.878, 0.919] 0.883 0.000 [0.863, 0.902] 1.858 R+ 1.854 R+

60% 0.636 0.000 [0.613, 0.659] 0.619 0.000 [0.597, 0.640] 1.559 R+ 1.555 R+

40% 0.428 0.000 [0.400, 0.455] 0.399 0.000 [0.374, 0.424] 1.342 R+ 1.338 R+


80% 0.564 0.000 [0.540, 0.588] 0.568 0.000 [0.547, 0.589] 1.525 R+ 1.521 R+

60% 0.422 0.000 [0.396, 0.449] 0.428 0.000 [0.405, 0.451] 1.397 R+ 1.393 R+

40% 0.326 0.000 [0.296, 0.357] 0.325 0.000 [0.298, 0.351] 1.294 R+ 1.289 R+


80% 0.092 0.000 [0.048, 0.135] 0.089 0.000 [0.052, 0.126] 1.057 R+ 1.051 R+

60% 0.078 0.001 [0.028, 0.129] 0.084 0.000 [0.041, 0.127] 1.046 R+ 1.039 R+

40% 0.138 0.000 [0.076, 0.201] 0.149 0.000 [0.096, 0.203] 1.102 R+ 1.094 R+

on all (M1 + M0 = 119, 847):100% 0.633 0.000 [0.614, 0.652] 0.628 0.000 [0.611, 0.645] 1.605 R+ 1.601 R+

80% 0.443 0.000 [0.422, 0.464] 0.437 0.000 [0.418, 0.455] 1.401 R+ 1.397 R+

60% 0.324 0.000 [0.301, 0.347] 0.321 0.000 [0.301, 0.340] 1.297 R+ 1.293 R+

40% 0.225 0.000 [0.198, 0.253] 0.220 0.000 [0.197, 0.243] 1.194 R+ 1.190 R+


80% 0.757 0.000 [0.736, 0.778] 0.696 0.000 [0.676, 0.716] 2.125 R+ 2.117 R+

60% 0.693 0.000 [0.668, 0.717] 0.627 0.000 [0.604, 0.649] 1.998 R+ 1.990 R+

40% 0.665 0.000 [0.635, 0.696] 0.581 0.000 [0.553, 0.608] 1.879 R+ 1.869 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status andfrom the same year; observations without a match are discarded. The number of matched pairs obtained is indicatedby M1 for the effect on the treated, and M0 for the effect on the untreated.2. The data set of Rose (2004) is used; the set of conditioning covariates x used in matching are as explained in Section 3,with the year dummies dropped from the list.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

39

Table 4: Rose (2004) data set – matching within periodpermutation test signed-rank test sensitivity analysis



80% 1.075 0.000 [1.052, 1.098] 1.075 0.000 [1.053, 1.096] 2.086 R+ 2.081 R+

60% 0.836 0.000 [0.810, 0.862] 0.835 0.000 [0.810, 0.859] 1.780 R+ 1.775 R+

40% 0.553 0.000 [0.522, 0.584] 0.535 0.000 [0.507, 0.563] 1.472 R+ 1.467 R+


80% 0.239 0.000 [0.192, 0.286] 0.200 0.000 [0.157, 0.241] 1.144 R+ 1.138 R+

60% 0.185 0.000 [0.131, 0.239] 0.138 0.000 [0.090, 0.187] 1.084 R+ 1.077 R+

40% 0.304 0.000 [0.239, 0.368] 0.243 0.000 [0.184, 0.301] 1.177 R+ 1.167 R+

on all (M1 + M0 = 135, 787):100% 1.177 0.000 [1.158, 1.196] 1.165 0.000 [1.147, 1.183] 2.213 R+ 2.208 R+

80% 0.899 0.000 [0.878, 0.919] 0.883 0.000 [0.863, 0.902] 1.858 R+ 1.854 R+

60% 0.636 0.000 [0.613, 0.659] 0.619 0.000 [0.597, 0.640] 1.559 R+ 1.555 R+

40% 0.428 0.000 [0.400, 0.455] 0.399 0.000 [0.374, 0.424] 1.342 R+ 1.338 R+


80% 0.564 0.000 [0.540, 0.588] 0.568 0.000 [0.547, 0.589] 1.525 R+ 1.521 R+

60% 0.422 0.000 [0.396, 0.449] 0.428 0.000 [0.405, 0.451] 1.397 R+ 1.393 R+

40% 0.326 0.000 [0.296, 0.357] 0.325 0.000 [0.298, 0.351] 1.294 R+ 1.289 R+


80% 0.092 0.000 [0.048, 0.135] 0.089 0.000 [0.052, 0.126] 1.057 R+ 1.051 R+

60% 0.078 0.001 [0.028, 0.129] 0.084 0.000 [0.041, 0.127] 1.046 R+ 1.039 R+

40% 0.138 0.000 [0.076, 0.201] 0.149 0.000 [0.096, 0.203] 1.102 R+ 1.094 R+

on all (M1 + M0 = 119, 847):100% 0.634 0.000 [0.615, 0.653] 0.629 0.000 [0.612, 0.646] 1.606 R+ 1.603 R+

80% 0.443 0.000 [0.422, 0.464] 0.437 0.000 [0.418, 0.455] 1.401 R+ 1.397 R+

60% 0.324 0.000 [0.301, 0.347] 0.321 0.000 [0.301, 0.340] 1.297 R+ 1.293 R+

40% 0.225 0.000 [0.198, 0.253] 0.220 0.000 [0.197, 0.243] 1.194 R+ 1.190 R+


80% 0.757 0.000 [0.736, 0.778] 0.696 0.000 [0.676, 0.716] 2.125 R+ 2.117 R+

60% 0.693 0.000 [0.668, 0.717] 0.627 0.000 [0.604, 0.649] 1.998 R+ 1.990 R+

40% 0.665 0.000 [0.635, 0.696] 0.581 0.000 [0.553, 0.608] 1.879 R+ 1.869 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment statusand from the same period, where the periods are: 1948 (Before Annecy round), 1949-1951 (Annecy to Torquay round),1952-1956 (Torquay to Geneva round), 1957-1961 (Geneva to Dillon round), 1962-1967 (Dillon to Kennedy round), 1968-1979 (Kennedy to Tokyo round), 1980-1994 (Tokyo to Uruguay round), and 1995-(After Uruguay round). Observationswithout a match are discarded. The number of matched pairs obtained is indicated by M1 for the effect on the treated,and M0 for the effect on the untreated.2. The data set of Rose (2004) is used; the set of conditioning covariates x used in matching are as explained in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

40

Table 5: Rose (2004) data set – matching within relative development stagepermutation test signed-rank test sensitivity analysis



80% 0.778 0.000 [0.753, 0.802] 0.734 0.000 [0.711, 0.757] 1.605 R+ 1.601 R+

60% 0.541 0.000 [0.514, 0.569] 0.504 0.000 [0.478, 0.530] 1.389 R+ 1.385 R+

40% 0.393 0.000 [0.359, 0.427] 0.364 0.000 [0.333, 0.395] 1.260 R+ 1.256 R+


80% 0.175 0.000 [0.128, 0.222] 0.131 0.000 [0.089, 0.173] 1.083 R+ 1.077 R+

60% 0.101 0.000 [0.047, 0.155] 0.059 0.008 [0.010, 0.107] 1.015 R+ 1.009 R+

40% 0.077 0.011 [0.011, 0.142] 0.036 0.110 [-0.021, 0.095] 1.011 R− 1.019 R−on all (M1 + M0 = 133, 972):100% 0.997 0.000 [0.977, 1.016] 0.955 0.000 [0.936, 0.973] 1.883 R+ 1.880 R+

80% 0.662 0.000 [0.640, 0.684] 0.612 0.000 [0.592, 0.633] 1.507 R+ 1.504 R+

60% 0.442 0.000 [0.418, 0.467] 0.402 0.000 [0.379, 0.424] 1.313 R+ 1.309 R+

40% 0.247 0.000 [0.217, 0.276] 0.208 0.000 [0.182, 0.235] 1.147 R+ 1.143 R+


80% 0.476 0.000 [0.452, 0.500] 0.460 0.000 [0.438, 0.481] 1.394 R+ 1.391 R+

60% 0.342 0.000 [0.315, 0.370] 0.324 0.000 [0.300, 0.347] 1.267 R+ 1.263 R+

40% 0.242 0.000 [0.211, 0.274] 0.239 0.000 [0.212, 0.266] 1.202 R+ 1.197 R+


80% 0.063 0.002 [0.020, 0.105] 0.051 0.003 [0.014, 0.087] 1.020 R+ 1.014 R+

60% 0.034 0.084 [-0.014, 0.083] 0.028 0.092 [-0.012, 0.070] 1.007 R− 1.013 R−40% 0.062 0.023 [0.001, 0.122] 0.048 0.037 [-0.003, 0.100] 1.003 R+ 1.004 R−on all (M1 + M0 = 119, 376):100% 0.544 0.000 [0.524, 0.563] 0.512 0.000 [0.495, 0.530] 1.455 R+ 1.452 R+

80% 0.391 0.000 [0.369, 0.412] 0.369 0.000 [0.350, 0.388] 1.320 R+ 1.316 R+

60% 0.215 0.000 [0.192, 0.239] 0.200 0.000 [0.179, 0.219] 1.164 R+ 1.161 R+

40% 0.175 0.000 [0.148, 0.202] 0.167 0.000 [0.144, 0.190] 1.140 R+ 1.137 R+


80% 0.588 0.000 [0.567, 0.608] 0.551 0.000 [0.531, 0.570] 1.813 R+ 1.807 R+

60% 0.507 0.000 [0.485, 0.530] 0.474 0.000 [0.452, 0.496] 1.706 R+ 1.699 R+

40% 0.410 0.000 [0.382, 0.437] 0.378 0.000 [0.352, 0.404] 1.538 R+ 1.530 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status andfrom the same relative development stage, where the combinations of relative development stages are: low-income/low-income dyads, low-income/middle-income dyads, low-income/high-income dyads, middle-income/middle-income dyads,middle-income/high-income dyads, and high-income/high-income dyads. Observations without a match are discarded.The number of matched pairs obtained is indicated by M1 for the effect on the treated, and M0 for the effect on theuntreated.2. The data set of Rose (2004) is used; the set of conditioning covariates x used in matching are as explained in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

41

Table 6: Tomz et al. (2007) data set – unrestricted matchingpermutation test signed-rank test sensitivity analysis


‘Both participating in GATT/WTO’ treatment effecton the treated (M1 = 152, 986):100% 1.418 0.000 [1.399, 1.438] 1.318 0.000 [1.300, 1.335] 2.431 R+ 2.426 R+

80% 1.260 0.000 [1.240, 1.281] 1.184 0.000 [1.165, 1.202] 2.289 R+ 2.284 R+

60% 1.089 0.000 [1.066, 1.113] 1.027 0.000 [1.006, 1.048] 2.063 R+ 2.058 R+

40% 0.762 0.000 [0.736, 0.789] 0.731 0.000 [0.707, 0.755] 1.712 R+ 1.706 R+


80% 0.343 0.000 [0.281, 0.406] 0.334 0.000 [0.276, 0.391] 1.290 R+ 1.280 R+

60% 0.361 0.000 [0.288, 0.434] 0.340 0.000 [0.274, 0.407] 1.288 R+ 1.275 R+

40% 0.404 0.000 [0.312, 0.496] 0.381 0.000 [0.299, 0.464] 1.316 R+ 1.301 R+

on all (M1 + M0 = 162, 689):100% 1.357 0.000 [1.339, 1.376] 1.256 0.000 [1.239, 1.272] 2.356 R+ 2.351 R+

80% 1.184 0.000 [1.165, 1.204] 1.107 0.000 [1.089, 1.125] 2.194 R+ 2.189 R+

60% 1.002 0.000 [0.980, 1.024] 0.938 0.000 [0.918, 0.957] 1.961 R+ 1.956 R+

40% 0.666 0.000 [0.641, 0.691] 0.640 0.000 [0.617, 0.663] 1.623 R+ 1.618 R+

‘One participating in GATT/WTO’ treatment effecton the treated (M1 = 71, 908):100% 0.818 0.000 [0.793, 0.844] 0.771 0.000 [0.749, 0.793] 1.782 R+ 1.777 R+

80% 0.631 0.000 [0.603, 0.658] 0.599 0.000 [0.575, 0.622] 1.585 R+ 1.580 R+

60% 0.444 0.000 [0.414, 0.473] 0.443 0.000 [0.418, 0.469] 1.428 R+ 1.423 R+

40% 0.304 0.000 [0.270, 0.338] 0.319 0.000 [0.289, 0.349] 1.300 R+ 1.295 R+

on the untreated (M0 = 9, 703):100% -0.040 0.078 [-0.096, 0.015] 0.014 0.278 [-0.031, 0.061] 1.025 R− 1.033 R−80% 0.006 0.419 [-0.054, 0.066] 0.026 0.157 [-0.023, 0.077] 1.017 R− 1.025 R−60% 0.028 0.215 [-0.041, 0.097] 0.055 0.037 [-0.004, 0.114] 1.004 R+ 1.005 R−40% 0.135 0.001 [0.049, 0.220] 0.171 0.000 [0.096, 0.246] 1.110 R+ 1.097 R+

on all (M1 + M0 = 81, 611):100% 0.716 0.000 [0.693, 0.740] 0.671 0.000 [0.651, 0.692] 1.672 R+ 1.668 R+

80% 0.523 0.000 [0.499, 0.548] 0.501 0.000 [0.480, 0.523] 1.488 R+ 1.484 R+

60% 0.349 0.000 [0.322, 0.375] 0.350 0.000 [0.328, 0.374] 1.339 R+ 1.335 R+

40% 0.206 0.000 [0.175, 0.236] 0.218 0.000 [0.191, 0.245] 1.197 R+ 1.192 R+


80% 0.726 0.000 [0.705, 0.747] 0.665 0.000 [0.646, 0.685] 2.072 R+ 2.065 R+

60% 0.667 0.000 [0.643, 0.691] 0.601 0.000 [0.579, 0.624] 1.952 R+ 1.944 R+

40% 0.621 0.000 [0.591, 0.651] 0.536 0.000 [0.508, 0.564] 1.791 R+ 1.782 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status; nofurther restriction is imposed. The number of matched pairs obtained is indicated by M1 for the effect on the treated,and M0 for the effect on the untreated.2. The data set of Tomz et al. (2007) is used; the set of conditioning covariates x used in matching are the same asthose listed in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

42

Table 7: Tomz et al. (2007) data set – matching within dyadpermutation test signed-rank test sensitivity analysis



80% 1.513 0.000 [1.467, 1.559] 1.520 0.000 [1.475, 1.565] 6.783 R+ 6.689 R+

60% 1.285 0.000 [1.232, 1.338] 1.281 0.000 [1.230, 1.333] 5.041 R+ 4.969 R+

40% 1.361 0.000 [1.294, 1.428] 1.355 0.000 [1.290, 1.420] 5.228 R+ 5.134 R+


80% 1.561 0.000 [1.496, 1.626] 1.547 0.000 [1.482, 1.613] 6.646 R+ 6.518 R+

60% 1.334 0.000 [1.259, 1.408] 1.291 0.000 [1.217, 1.365] 5.028 R+ 4.927 R+

40% 1.060 0.000 [0.971, 1.150] 0.974 0.000 [0.890, 1.059] 3.607 R+ 3.527 R+

on all (M1 + M0 = 12, 149):100% 1.618 0.000 [1.585, 1.652] 1.622 0.000 [1.589, 1.655] 8.036 R+ 7.950 R+

80% 1.536 0.000 [1.498, 1.574] 1.542 0.000 [1.504, 1.579] 6.760 R+ 6.684 R+

60% 1.417 0.000 [1.373, 1.460] 1.404 0.000 [1.361, 1.446] 5.945 R+ 5.872 R+

40% 1.315 0.000 [1.261, 1.369] 1.291 0.000 [1.238, 1.344] 5.065 R+ 4.992 R+


80% 0.716 0.000 [0.675, 0.757] 0.730 0.000 [0.692, 0.768] 2.413 R+ 2.393 R+

60% 0.738 0.000 [0.687, 0.789] 0.765 0.000 [0.718, 0.813] 2.301 R+ 2.279 R+

40% 0.546 0.000 [0.483, 0.608] 0.573 0.000 [0.516, 0.629] 1.860 R+ 1.840 R+


80% 0.633 0.000 [0.574, 0.692] 0.645 0.000 [0.591, 0.699] 2.014 R+ 1.993 R+

60% 0.473 0.000 [0.404, 0.542] 0.469 0.000 [0.407, 0.532] 1.624 R+ 1.604 R+

40% 0.396 0.000 [0.311, 0.482] 0.392 0.000 [0.317, 0.466] 1.477 R+ 1.456 R+

on all (M1 + M0 = 18, 185):100% 0.817 0.000 [0.787, 0.846] 0.836 0.000 [0.808, 0.863] 2.741 R+ 2.725 R+

80% 0.719 0.000 [0.685, 0.753] 0.736 0.000 [0.705, 0.767] 2.386 R+ 2.370 R+

60% 0.643 0.000 [0.602, 0.684] 0.660 0.000 [0.623, 0.698] 2.074 R+ 2.059 R+

40% 0.411 0.000 [0.360, 0.461] 0.415 0.000 [0.371, 0.460] 1.579 R+ 1.565 R+


80% 0.480 0.000 [0.467, 0.494] 0.482 0.000 [0.470, 0.494] 2.416 R+ 2.407 R+

60% 0.375 0.000 [0.358, 0.391] 0.371 0.000 [0.357, 0.386] 1.901 R+ 1.893 R+

40% 0.265 0.000 [0.243, 0.287] 0.257 0.000 [0.238, 0.275] 1.502 R+ 1.494 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status andfrom the same dyad; observations without a match are discarded. The number of matched pairs obtained is indicatedby M1 for the effect on the treated, and M0 for the effect on the untreated.2. The data set of Tomz et al. (2007) is used; the set of conditioning covariates x used in matching are the same asthose listed in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

43

Table 8: Tomz et al. (2007) data set – matching within yearpermutation test signed-rank test sensitivity analysis



80% 1.260 0.000 [1.240, 1.281] 1.184 0.000 [1.165, 1.202] 2.289 R+ 2.284 R+

60% 1.089 0.000 [1.066, 1.113] 1.027 0.000 [1.006, 1.048] 2.063 R+ 2.058 R+

40% 0.762 0.000 [0.736, 0.789] 0.731 0.000 [0.707, 0.755] 1.712 R+ 1.706 R+


80% 0.343 0.000 [0.281, 0.406] 0.334 0.000 [0.276, 0.391] 1.290 R+ 1.280 R+

60% 0.361 0.000 [0.288, 0.434] 0.340 0.000 [0.274, 0.407] 1.288 R+ 1.275 R+

40% 0.404 0.000 [0.312, 0.496] 0.381 0.000 [0.299, 0.464] 1.316 R+ 1.301 R+

on all (M1 + M0 = 162, 689):100% 1.365 0.000 [1.347, 1.384] 1.265 0.000 [1.249, 1.282] 2.368 R+ 2.363 R+

80% 1.184 0.000 [1.165, 1.204] 1.107 0.000 [1.089, 1.125] 2.194 R+ 2.189 R+

60% 1.002 0.000 [0.980, 1.024] 0.938 0.000 [0.918, 0.957] 1.961 R+ 1.956 R+

40% 0.666 0.000 [0.641, 0.691] 0.640 0.000 [0.617, 0.663] 1.623 R+ 1.618 R+


80% 0.631 0.000 [0.603, 0.658] 0.599 0.000 [0.575, 0.622] 1.585 R+ 1.580 R+

60% 0.444 0.000 [0.414, 0.473] 0.443 0.000 [0.418, 0.469] 1.428 R+ 1.423 R+

40% 0.304 0.000 [0.270, 0.338] 0.319 0.000 [0.289, 0.349] 1.300 R+ 1.295 R+


on all (M1 + M0 = 81, 611):100% 0.720 0.000 [0.696, 0.743] 0.675 0.000 [0.654, 0.695] 1.676 R+ 1.672 R+

80% 0.523 0.000 [0.499, 0.548] 0.501 0.000 [0.480, 0.523] 1.488 R+ 1.484 R+

60% 0.349 0.000 [0.322, 0.375] 0.350 0.000 [0.328, 0.374] 1.339 R+ 1.335 R+

40% 0.206 0.000 [0.175, 0.236] 0.218 0.000 [0.191, 0.245] 1.197 R+ 1.192 R+


80% 0.726 0.000 [0.705, 0.747] 0.665 0.000 [0.646, 0.685] 2.072 R+ 2.065 R+

60% 0.667 0.000 [0.643, 0.691] 0.601 0.000 [0.579, 0.624] 1.952 R+ 1.944 R+

40% 0.621 0.000 [0.591, 0.651] 0.536 0.000 [0.508, 0.564] 1.791 R+ 1.782 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status andfrom the same year; observations without a match are discarded. The number of matched pairs obtained is indicatedby M1 for the effect on the treated, and M0 for the effect on the untreated.2. The data set of Tomz et al. (2007) is used; the set of conditioning covariates x used in matching are the same asthose listed in Section 3, with the year dummies dropped from the list.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

44

Table 9: Tomz et al. (2007) data set – matching within periodpermutation test signed-rank test sensitivity analysis



80% 1.260 0.000 [1.240, 1.281] 1.184 0.000 [1.165, 1.202] 2.289 R+ 2.284 R+

60% 1.089 0.000 [1.066, 1.113] 1.027 0.000 [1.006, 1.048] 2.063 R+ 2.058 R+

40% 0.762 0.000 [0.736, 0.789] 0.731 0.000 [0.707, 0.755] 1.712 R+ 1.706 R+


80% 0.343 0.000 [0.281, 0.406] 0.334 0.000 [0.276, 0.391] 1.290 R+ 1.280 R+

60% 0.361 0.000 [0.288, 0.434] 0.340 0.000 [0.274, 0.407] 1.288 R+ 1.275 R+

40% 0.404 0.000 [0.312, 0.496] 0.381 0.000 [0.299, 0.464] 1.316 R+ 1.301 R+

on all (M1 + M0 = 162, 689):100% 1.365 0.000 [1.346, 1.383] 1.264 0.000 [1.248, 1.281] 2.367 R+ 2.362 R+

80% 1.184 0.000 [1.165, 1.204] 1.107 0.000 [1.089, 1.125] 2.194 R+ 2.189 R+

60% 1.002 0.000 [0.980, 1.024] 0.938 0.000 [0.918, 0.957] 1.961 R+ 1.956 R+

40% 0.666 0.000 [0.641, 0.691] 0.640 0.000 [0.617, 0.663] 1.623 R+ 1.618 R+


80% 0.631 0.000 [0.603, 0.658] 0.599 0.000 [0.575, 0.622] 1.585 R+ 1.580 R+

60% 0.444 0.000 [0.414, 0.473] 0.443 0.000 [0.418, 0.469] 1.428 R+ 1.423 R+

40% 0.304 0.000 [0.270, 0.338] 0.319 0.000 [0.289, 0.349] 1.300 R+ 1.295 R+


on all (M1 + M0 = 81, 611):100% 0.718 0.000 [0.694, 0.741] 0.672 0.000 [0.652, 0.693] 1.673 R+ 1.669 R+

80% 0.523 0.000 [0.499, 0.548] 0.501 0.000 [0.480, 0.523] 1.488 R+ 1.484 R+

60% 0.349 0.000 [0.322, 0.375] 0.350 0.000 [0.328, 0.374] 1.339 R+ 1.335 R+

40% 0.206 0.000 [0.175, 0.236] 0.218 0.000 [0.191, 0.245] 1.197 R+ 1.192 R+


80% 0.726 0.000 [0.705, 0.747] 0.665 0.000 [0.646, 0.685] 2.072 R+ 2.065 R+

60% 0.667 0.000 [0.643, 0.691] 0.601 0.000 [0.579, 0.624] 1.952 R+ 1.944 R+

40% 0.621 0.000 [0.591, 0.651] 0.536 0.000 [0.508, 0.564] 1.791 R+ 1.782 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment statusand from the same period, where the periods are: 1948 (Before Annecy round), 1949-1951 (Annecy to Torquay round),1952-1956 (Torquay to Geneva round), 1957-1961 (Geneva to Dillon round), 1962-1967 (Dillon to Kennedy round), 1968-1979 (Kennedy to Tokyo round), 1980-1994 (Tokyo to Uruguay round), and 1995-(After Uruguay round). Observationswithout a match are discarded. The number of matched pairs obtained is indicated by M1 for the effect on the treated,and M0 for the effect on the untreated.2. The data set of Tomz et al. (2007) is used; the set of conditioning covariates x used in matching are the same asthose listed in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

45

Table 10: Tomz et al. (2007) data set – matching within relative development stagepermutation test signed-rank test sensitivity analysis



80% 0.710 0.000 [0.690, 0.729] 0.704 0.000 [0.686, 0.722] 1.629 R+ 1.626 R+

60% 0.515 0.000 [0.491, 0.539] 0.486 0.000 [0.465, 0.507] 1.385 R+ 1.382 R+

40% 0.461 0.000 [0.432, 0.491] 0.425 0.000 [0.399, 0.451] 1.328 R+ 1.324 R+


80% 0.169 0.000 [0.107, 0.231] 0.155 0.000 [0.099, 0.212] 1.102 R+ 1.093 R+

60% 0.168 0.000 [0.095, 0.241] 0.149 0.000 [0.085, 0.214] 1.090 R+ 1.079 R+

40% 0.200 0.000 [0.110, 0.289] 0.162 0.000 [0.081, 0.243] 1.088 R+ 1.075 R+

on all (M1 + M0 = 162, 689):100% 1.012 0.000 [0.995, 1.028] 1.024 0.000 [1.008, 1.040] 2.032 R+ 2.028 R+

80% 0.656 0.000 [0.637, 0.675] 0.646 0.000 [0.628, 0.663] 1.574 R+ 1.571 R+

60% 0.467 0.000 [0.444, 0.490] 0.434 0.000 [0.414, 0.454] 1.344 R+ 1.341 R+

40% 0.402 0.000 [0.374, 0.430] 0.365 0.000 [0.342, 0.390] 1.287 R+ 1.284 R+


80% 0.278 0.000 [0.251, 0.306] 0.300 0.000 [0.276, 0.324] 1.248 R+ 1.244 R+

60% 0.290 0.000 [0.259, 0.320] 0.292 0.000 [0.265, 0.318] 1.246 R+ 1.241 R+

40% 0.172 0.000 [0.137, 0.207] 0.194 0.000 [0.163, 0.224] 1.159 R+ 1.154 R+


80% 0.099 0.000 [0.042, 0.157] 0.081 0.001 [0.030, 0.131] 1.040 R+ 1.032 R+

60% 0.028 0.205 [-0.039, 0.096] 0.020 0.250 [-0.037, 0.080] 1.030 R− 1.040 R−40% 0.103 0.008 [0.019, 0.187] 0.085 0.013 [0.010, 0.159] 1.022 R+ 1.010 R+

on all (M1 + M0 = 81, 611):100% 0.420 0.000 [0.398, 0.441] 0.443 0.000 [0.423, 0.462] 1.416 R+ 1.412 R+

80% 0.242 0.000 [0.217, 0.267] 0.258 0.000 [0.236, 0.280] 1.216 R+ 1.213 R+

60% 0.238 0.000 [0.211, 0.266] 0.246 0.000 [0.222, 0.269] 1.214 R+ 1.209 R+

40% 0.172 0.000 [0.141, 0.203] 0.179 0.000 [0.153, 0.206] 1.157 R+ 1.152 R+


80% 0.569 0.000 [0.549, 0.590] 0.529 0.000 [0.510, 0.549] 1.793 R+ 1.786 R+

60% 0.489 0.000 [0.466, 0.511] 0.458 0.000 [0.437, 0.480] 1.686 R+ 1.679 R+

40% 0.401 0.000 [0.373, 0.428] 0.366 0.000 [0.341, 0.392] 1.517 R+ 1.510 R+

Note:1. The pool of potential matches for an observation are restricted to observations with the opposite treatment status andfrom the same relative development stage, where the combinations of relative development stages are: low-income/low-income dyads, low-income/middle-income dyads, low-income/high-income dyads, middle-income/middle-income dyads,middle-income/high-income dyads, and high-income/high-income dyads. Observations without a match are discarded.The number of matched pairs obtained is indicated by M1 for the effect on the treated, and M0 for the effect on theuntreated.2. The data set of Tomz et al. (2007) is used; the set of conditioning covariates x used in matching are the same asthose listed in Section 3.3. The caliper is set such that only 100%, 80%, 60%, or 40% of matched pairs are qualified for the estimation of thetreatment effect. For example, in the case of 60%, matched pairs with a distance (in terms of x) exceeding the upper60 percentile of all matched pairs are discarded.4. In the column ‘permutation test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Dstatistic; the p-value is obtained for the observed D statistic using the permutation test; the CI is obtained by invertingthe permutation test procedure.5. In the column ‘signed-rank test,’ the ‘effect’ sub-column reports the treatment effect estimate based on the Hodgesand Lehmann (1963) estimator; the p-value is obtained for the observed R statistic using the signed-rank test; the CIis obtained by inverting the signed-rank test procedure.6. In the column ‘sensitivity analysis,’ the sensitivity analysis is conducted for the above signed-rank test based on asignificance level of α = 0.05 in a one-sided or two-sided test. R+ or R− (as a function of the odds ratio Γ of takingthe treatment) indicates the relevant distribution in calculating the critical bound Γ∗ at which the conclusion of thesigned-rank test reverses.

46