ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999

8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999

1/22

1

STATISTICAL METHODS FOR

SEMICONDUCTOR MANUFACTURING

DUANE S. B

ONING

Massachusetts Institute of TechnologyCambridge, MA

J

ERRY

S

TEFANI

AND

S

TEPHANIE

W. B

UTLER

Texas InstrumentsDallas, TX

Semiconductor manufacturing increasingly depends on theuse of statistical methods in order to develop and explorenew technologies, characterize and optimize existingprocesses, make decisions about evolutionary changes toimprove yields and product quality, and monitor andmaintain well-functioning processes. Each of these activitiesrevolves around data: Statistical methods are essential in theplanning of experiments to efficiently gather relevant data,the construction and evaluation of models based on data, anddecision-making using these models.

THE NATURE OF DATA IN SEMICONDUCTOR MANU-

FACTURING

An overview of the variety of data utilized in semiconductormanufacturing helps to set the context for application of thestatistical methods to be discussed here. Most data in atypical semiconductor fabrication plant (or fab) consists ofin-line and end-of-line data used for process control andmaterial disposition. In-line data includes measurementsmade on wafer characteristics throughout the process(performed on test or monitor wafers, as well as product

wafers), such as thicknesses of thin films after deposition,etch, or polish; critical dimensions (such as polysiliconlinewidths) after lithography and etch steps; particle counts,sheet resistivities, etc. In addition to in-line wafermeasurements, a large amount of data is collected on processand equipment parameters. For example, in plasma etchprocesses time series of the intensities of plasma emissionare often gathered for use in detecting the endpoint(completion) of etch, and these endpoint traces (as well asnumerous other parameters) are recorded. At the end-of-line,volumes of test data are taken on completed wafers. Thisincludes both electrical test (or etest) data reportingparametric information about transistor and other deviceperformance on die and scribe line structures; and sort oryield tests to identify which die on the wafer function orperform at various specifications (e.g., binning by clockspeed). Such data, complemented by additional sources (e.g.,fab facility environmental data, consumables or vendor data,offline equipment data) are all crucial not only for day to daymonitoring and control of the fab, but also to investigateimprovements, characterize the effects of uncontrolledchanges (e.g., a change in wafer supply parameters), and to

redesign or optimize process steps.

For correct application of statistical methods, one mustrecognize the idiosyncrasies of such measurements arising insemiconductor manufacturing. In particular, the usualassumptions in statistical modeling of independence andidenticality of measurements often do not hold, or only holdunder careful handling or aggregation of the data. Consider

the measurement of channel length (polysilicon criticaldimension or CD) for a particular size transistor. First, theposition of the transistor within the chip may affect thechannel length: Nominally identical structures mayindirectly be affected by pattern dependencies or otherunintentional factors, giving rise to within die variation. Ifone wishes to compare CD across multiple die on the wafer(e.g., to measure die to die variation), one must thus becareful to measure the same structure on each die to becompared. This same theme of blocking againstextraneous factors to study the variation of concern isrepeated at multiple length and time scales. For example, wemight erroneously assume that every die on the wafer isidentical, so that any random sample of, say, five die acrossthe wafer can be averaged to give an accurate value for CD.Because of systematic wafer level variation (e.g., due tolarge scale equipment nonuniformities), however, eachmeasurement of CD is likely to also be a function of theposition of the die on the wafer. The average of these fivepositions can still be used as an aggregate indicator for theprocess, but care must be taken that the same five positionsare always measured, and this average value cannot be takenas indicative of any one measurement. Continuing with thisexample, one might also take five die averages of CD forevery wafer in a 25 wafer lot. Again, one must be careful to

recognize that systematic wafer-to-wafer variation sourcesmay exist. For example, in batch polysilicon depositionprocesses, gas flow and temperature patterns within thefurnace tube will give rise to systematic wafer positiondependencies; thus the 25 different five-die averages are notindependent. Indeed, if one constructs a control chart underthe assumption of independence, the estimate of five-dieaverage variance will be erroneously inflated (it will alsoinclude the systematic within-lot variation), and the chartwill have control limits set too far apart to detect importantcontrol excursions. If one is careful to consistently measuredie-averages on the same wafers (e.g., the 6th, 12th, and 18thwafers) in each lot and aggregate these as a lot average,

one can then construct a historical distribution of thisaggregate observation (where an observation thus consistsof several measurements) and correctly design a controlchart. Great care must be taken to account for or guardagainst these and other systematic sources of variation whichmay give rise to lot-to-lot variation; these include tooldependencies, consumables or tooling (e.g., the specificreticle used in photolithography), as well as otherunidentified time-dependencies.

Encyclopedia of Electrical Engineering, J. G. Webster, Ed., Wiley, to appear, Feb. 1999.


2/22

2

Formal statistical assessment is an essential complementto engineering knowledge of known and suspected causaleffects and systematic sources of variation. Engineeringknowledge is crucial to specify data collection plans thatensure that statistical conclusions are defensible. If data iscollected but care has not been not taken to make faircomparisons, then the results will not be trusted no matterwhat statistical method is employed or amount of datacollected. At the same time, correct application of statisticalmethods are also crucial to correct interpretation of theresults. Consider a simple scenario: The deposition area in aproduction fab has been running a standard process recipefor several months, and has monitored defect counts on eachwafer. An average of 12 defects per 8 in. wafer has beenobserved over that time. The deposition engineer believesthat a change in the gasket (from a different supplier) on the

deposition vacuum chamber can reduce the number ofdefects. The engineer makes the change and runs two lots of25 wafers each with the change (observing an average of 10defects per wafer), followed by two additional lots with theoriginal gasket type (observing an average of 11 defects perwafer). Should the change be made permanently? Adifference in output has been observed, but a key questionremains: Is the change statistically significant? That is tosay, considering the data collected and the systemscharacteristics, has a change really occurred or might thesame results be explained by chance?

Overview of Statistical Methods

Different statistical methods are used in semiconductormanufacturing to understand and answer questions such asthat posed above and others, depending on the particularproblem. Table 1 summarizes the most common methodsand for what purpose they are used, and it serves as a briefoutline of this article. The key engineering tasks involved insemiconductor manufacturing drawing on these methods aresummarized in Table 2. In all cases, the issues of samplingplans and significance of the findings must be considered,

and all sections will periodically address these issues. Tohighlight these concepts, note that in Table 1 the wordsdifferent, same, good, and improve are mentioned.These words tie together two critical issues: significance andengineering importance. When applying statistics to the data,one first determines if a statistically significant differenceexists, otherwise the data (or experimental effects) areassumed to be the same. Thus, what is really tested iswhether there is a difference that is big enough to findstatistically. Engineering needs and principles determinewhether that difference is good, the difference actuallymatters, or the cost of switching to a different process will beoffset by the estimated improvements. The size of thedifference that can be statistically seen is determined by thesampling plans and the statistical method used.Consequently, engineering needs must enter into the design

of the experiments (sampling plan) so that the statistical testwill be able to see a difference of the appropriate size.Statistical methods provide the means to determine samplingand significance to meet the needs of the engineer andmanager.

In the first section, we focus on the basic underlyingissues of statistical distributions, paying particular attentionto those distributions typically used to model aspects ofsemiconductor manufacturing, including the indispensableGaussian distribution as well as binomial and Poissondistributions (which are key to modeling of defect and yieldrelated effects). An example use of basic distributions is to

estimate the interval of oxide thickness in which the engineeris confident that 99% of wafers will reside; based on thisinterval, the engineer could then decide if the process ismeeting specifications and define limits or tolerances forchip design and performance modeling.

In the second section, we review the fundamental tool ofstatistical inference, the hypothesis test as summarized inTable 1. The hypothesis test is crucial in detecting

Table 1: Summary of Statistical Methods Typically Used in Semiconductor Manufacturing

Topic Statistical Method Purpose

1 Statistical distributions Basic material for statistical tests. Used to characterize a populationbased upon a sample.

2 Hypothesis testing Decide whether data under investigation indicates that elementsof concern are the same or different.

3 Experimental design andanalysis of variance

Determine significance of factors and models; decomposeobserved variation into constituent elements.

4 Response surface modeling Understand relationships, determine process margin, andoptimize process.

5 Categorical modeling Use when result or response is discrete (such as very rough,rough, or smooth). Understand relationships, determineprocess margin, and optimize process.

6 Statistical process control Determine if system is operating as expected.


3/22

3

differences in a process. Examples include: determining ifthe critical dimensions produced by two machines aredifferent; deciding if adding a clean step will decrease thevariance of the critical dimension etch bias; determining ifno appreciable increase in particles will occur if the intervalbetween machine cleans is extended by 10,000 wafers; ordeciding if increasing the target doping level will improve adevices threshold voltage.

In the third section, we expand upon hypothesis testingto consider the fundamentals of experimental design andanalysis of variance, including the issue of sampling requiredto achieve the desired degree of confidence in the existence

of an effect or difference, as well as accounting for the risk innot detecting a difference. Extensions beyond single factorexperiments to the design of experiments which screen foreffects due to several factors or their interactions are thendiscussed, and we describe the assessment of suchexperiments using formal analysis of variance methods.Examples of experimental design to enable decision-makingabound in semiconductor manufacturing. For example, onemight need to decide if adding an extra film or switchingdeposition methods will improve reliability, or decide whichof three gas distribution plates (each with different holepatterns) provides the most uniform etch process.

In the fourth section, we examine the construction ofresponse surface or regression models of responses as afunction of one or more continuous factors. Of particularimportance are methods to assess the goodness of fit anderror in the model, which are essential to appropriate use ofregression models in optimization or decision-making.Examples here include determining the optimal values fortemperature and pressure to produce wafers with no morethan 2% nonuniformity in gate oxide thickness, or

determining if typical variations in plasma power will causeout-of-specification materials to be produced.

Finally, we note that statistical process control (SPC) formonitoring the normal or expected behavior of a process isa critical statistical method (1,2). The fundaments ofstatistical distributions and hypothesis testing discussed herebear directly on SPC; further details on statistical processmonitoring and process optimization can be found inS

EMICONDUCTOR

F

ACTORY

C

ONTROL

AND

O

PTIMIZATION

.

STATISTICAL DISTRIBUTIONS

Semiconductor technology development andmanufacturing are often concerned with both continuousparameters (e.g., thin-film thicknesses, electricalperformance parameters of transistors) and discreteparameters (e.g., defect counts and yield). In this section, webegin with a brief review of the fundamental probabilitydistributions typically encountered in semiconductormanufacturing, as well as sampling distributions which arisewhen one calculates statistics based on multiplemeasurements (3). An understanding of these distributions iscrucial to understanding hypothesis testing, analysis ofvariance, and other inferencing and statistical analysismethods discussed in later sections.

Descriptive Statistics

Descriptive statistics are often used to concisely presentcollections of data. Such descriptive statistics are basedentirely (and only) on the available empirical data, and theydo not assume any underlying probability model. Suchdescriptions include histograms (plots of relative frequency

versus measured values or value ranges in some

Table 2: Engineering Tasks in Semiconductor Manufacturing and Statistical Methods Employed

Engineering Task Statistical Method

Process and equipment control (e.g.,CDs, thickness, deposition/removalrates, endpoint time, etest and sort data,yield)

Control chart for (for wafer or lot mean) and s

(for within-wafer unifor-mity) based on the distribution of each parameter. Control limits and con-trol rules are based on the distribution and time patterns in the parameter.

Equipment matching ANOVA on and/or s

over sets of equipment, fabs, or vendors.

Yield/performance modelingand prediction

Regression of end-of-line measurement as responseagainst in-line measurements as factors.

Debug ANOVA, categorical modeling, correlation to find most influential factors toexplain the difference between good versus bad material.

Feasibility of process improvement One-sample comparisons on limited data; single lot factorial experimentdesign; response surface models over a wider range of settings looking foropportunities for optimization, or to establish process window over a nar-rower range of settings.

Validation of process improvement Two-sample comparison run over time to ensure a change is robust andworks over multiple tools, and no subtle defects are incurred by change.

x

x

i xi


4/22

4

parameter), as well as calculation of the mean:

(1)

(where is the number of values observed), calculation ofthe median (value in the middle of the data with an equalnumber of observations below and above), and calculation ofdata percentiles. Descriptive statistics also include thesample variance and sample standard deviation:

(2)

(3)

A drawback to such descriptive statistics is that theyinvolve the study of observed data only and enable us todraw conclusions which relate only to that specific data.Powerful statistical methods, on the other hand, have cometo be based instead on probability theory; this allows us torelate observations to some underlying probability modeland thus make inferences about the population (thetheoretical set of all possible observations) as well as thesample (those observations we have in hand). It is the use ofthese models that give computed statistics (such as the mean)explanatory power.

Probability Model

Perhaps the simplest probability model of relevance insemiconductor manufacturing is the Bernoulli distribution. ABernoulli trial is an experiment with two discrete outcomes:success or failure. We can model

the a priori

probability(based on historical data or theoretical knowledge) of asuccess simply as p

. For example, we may have aggregate

historical data that tells us that line yield is 95% (i.e., that95% of product wafers inserted in the fab successfullyemerge at the end of the line intact). We make the leap fromthis descriptive information to an assumption of anunderlying probability model: we suggest that theprobability of any one wafer making it through the line isequal to 0.95. Based on that probability model, we canpredict an outcome for a new wafer which has not yet beenprocessed and was not part of the original set ofobservations. Of course, the use of such probability modelsinvolves assumptions -- for example that the fab and allfactors affecting line yield are essentially the same for thenew wafer as for those used in constructing the probabilitymodel. Indeed, we often find that each wafer is not

independent of other wafers, and such assumptions must beexamined carefully.

Normal (Gaussian) Distribution. In addition to discreteprobability distributions, continuous distributions also play acrucial role in semiconductor manufacturing. Quite often,one is interested in the probability density function (or pdf)for some parametric value. The most important continuous

distribution (in large part due to the central limit theorem) isthe Gaussian or normal distribution. We can write that arandom variable x

is distributed as a normal distribution

with mean and variance as . Theprobability density function for x is given by

(4)

which is also often discussed in unit normal form through the

normalization , so that :

(5)

Given a continuous probability density function, onecan talk about the probability of finding a value in somerange. Consider the oxide thickness at the center of a waferafter rapid thermal oxidation in a single wafer system. If thisoxide thickness is normally distributed with and

2

(or a standard deviation of 3.16 ), then theprobability of any one such measurement x

falling between105 and 120 can be determined as

(6)

where is the cumulative density function for the unitnormal, which is available via tables or statistical analysispackages.

We now briefly summarize other common discreteprobability mass functions (pmfs) and continuousprobability density functions (pdfs) that arise in

semiconductor manufacturing, and then we turn to samplingdistributions that are also important.

Binomial Distribution. Very often we are interested in thenumber of successes in repeated Bernoulli trials (that is,repeated succeed or fail trials). If x

is the number ofsuccesses in n

trials, then x

is distributed as a binomialdistribution , where p

is the probability of eachindividual success. The pmf is given by

x1n--- xi

i 1=

n

=

n

sx2 Sample Var x{ } 1

n 1------------ xi x( )

2

i 1=

n

= =

sx Sample std. dev. x{ } sx2= =

2 x N 2,( )

f x( ) 1 2--------------e

12

------ x

------------

2

=

zx

------------= z N 1 0,( )

f z( ) 12

----------e

12---z2

=

100=

2 10=

Pr 105 x 120 ( ) 1 2--------------e

12--- x

------------

2

xd

105

120

=

Pr105

------------------ x

------------ 120

------------------

=

Pr zl z zu ( )1

2----------e

12---z2

zd

zl

zu

= =

1

2----------e

12---z2

zd

zu

1

2----------e

12---z2

zd

zl

=

zu( ) zl( )=

6.325( ) 1.581( ) 0.0569= =

z( )

x B n p,( )


5/22

5

(7)

where nchoosex is

(8)

For example, if one is starting a 25-wafer lot in the fababove, one may wish to know what is the probability thatsome numberx(xbeing between 0 and 25) of those wafers

will survive. For the line yield model of p = 95%, theseprobabilities are shown in Fig. 1. When nis very large (muchlarger than 25), the binomial distribution is wellapproximated by a Gaussian distribution.

A key assumption in using the Bernoulli distribution isthat each wafer is independent of the other wafers in the lot.

One often finds that many lots complete processing with allwafers intact, and other lots suffer multiple damaged wafers.While an aggregate line yield of 95% may result,independence does not hold and alternative approaches suchas formation of empirical distributions should be used.

Poisson Distribution. A third discrete distribution is highlyrelevant to semiconductor manufacturing. An approximationto the binomial distribution that applies when nis large andpis small is the Poisson distribution:

(9)

for integer and .

For example, one can examine the number of chips thatfail on the first day of operation: The probabilityp of failurefor any one chip is (hopefully) exceedingly small, but onetests a very large number nof chips, so that the observationof the mean number of failed chips is Poissondistributed. An even more common application of thePoisson model is in defect modeling. For example, if defects

are Poisson-distributed with a mean defect count of 3particles per 200 mm wafer, one can ask questions about theprobability of observingx[e.g., Eq. (9)] defects on a sample

wafer. In this case, , or lessthan 0.3% of the time would we expect to observe exactly 9defects. Similarly, the probability that 9 or more defects are

observed is .

In the case of defect modeling, several otherdistributions have historically been used, including theexponential, hypergeometric, modified Poisson, and negativebinomial. Substantial additional work has been reported inyield modeling to account for clustering, and to understandthe relationship between defect models and yield (e.g., seeRef. 4).

Population Versus Sample Statistics

We now have the beginnings of a statistical inference theory.

Before proceeding with formal hypothesis testing in the nextsection, we first note that the earlier descriptive statistics ofmean and variance take on new interpretations in theprobabilistic framework. The mean is the expectation (orfirst moment) over the distribution:

(10)

for continuous pdfs and discrete pmfs, respectively.Similarly, the variance is the expectation of the squareddeviation from the mean (or the second central moment)over the distribution:

(11)

Further definitions from probability theory are also highlyuseful, including the covariance:

(12)

where x and y are each random variables with their ownprobability distributions, as well as the related correlationcoefficient:

(13)

The above definitions for the mean, variance,covariance, and correlation all relate to the underlying or

Figure 1. Probabilities for number of wafers survivingfrom a lot of 25 wafers, assuming a line yield of 90%,calculated using the binomial distribution.

f x p n, ,( ) nx px 1 p( )n x=

nx n!

x! n x( )!-----------------------=

0

0.05

0. 1

0.15

0. 2

0.25

0. 3

0.35

0. 4

0 5 1 0 1 5 2 0 2 5

Wafers surviving

f x ,( ) e x

x!--------------=

0 1 2 , , ,= np

np=

f 9 3,( ) e 3 39( ) 9! 0.0027= =

1 f x 3,( )x 0=

8

0.0038=

x E x{ } xf x( ) xd

= =

xi p rxi( )i 1=

n

=

x2 Var x{ } x E x{ }( )2f x( ) xd

= =

xi E xi{ }( )2

p rxi( )i 1=

n

=

xy2 Cov x y,{ } E x E x{ }( ) y E y{ }( ){ }= =

E xy{ } E x{ }E y{ }=

xy Corr x y,{ }Cov x y,{ }

Var x{ }Var y{ }------------------------------------------

xy2

xy------------= = =


6/22

6

assumedpopulation. When one only has a sample (that is, afinite number of values drawn from some population), onecalculates the corresponding sample statistics. These are nolonger descriptive of only the sample we have; rather,these statistics are now estimates of parameters in aprobability model. Corresponding to the populationparameters above, the sample mean is given by Eq. (1), the

sample variance by Eq. (2), the sample std. dev. by Eq.(3), the sample covariance is given by

(14)

and the sample correlation coefficient is given by

(15)

Sampling Distributions

Sampling is the act of making inferences about populationsbased on some number of observations. Random sampling isespecially important and desirable, where each observationis independent and identically distributed. A statistic is afunction of sample data which contains no further unknowns(e.g., the sample mean can be calculated from theobservations and has no further unknowns). It is important tonote that a sample statistic is itself a random variable and hasa sampling distribution which is usually differentthan theunderlying population distribution. In order to reason aboutthe likelihood of observing a particular statistic (e.g., themean of five measurements), one must be able to constructthe underlying sampling distribution.

Sampling distributions are also intimately bound up

with estimation of population distribution parameters. Forexample, suppose we know that the thickness of gate oxide(at the center of the wafer) is normally distributed:

. We sample five random wafers

and compute the mean oxide thickness

. We now have two key questions:

(1) What is the distributionof ? (2) What is the probability

that ?

In this case, given the expression for above, we can

use the fact that the variance of a scaled random variableis simply , and the variance of a sum ofindependent random variables is the sum of the variances

(16)where by the definition of the mean. Thus,

when we want to reason about the likelihood of observing

values of (that is, averages of five sample measurements)

lying within particular ranges, we must be sure to use the

distribution for rather than that for the underlyingdistribution T. Thus, in this case the probability of finding a

value between 105 and 120 is

(17)

which is relatively unlikely. Compare this to the result fromEq. (6) for the probability 0.0569 of observing a single value(rather than a five sample average) in the range 105 to 120.

Note that the same analysis would apply if was itself

formed as an average of underlying oxide thickness

measurements (e.g., from a nine point pattern across thewafer), if that average and each of the five

wafers used in calculating the sample average areindependent and also distributed as .

Chi-Square Distribution and Variance Estimates. Severalother vitally important distributions arise in sampling, andare essential to making statistical inferences in experimentaldesign or regression. The first of these is the chi-squaredistribution. If for and

, thenyis distributed as chi-square with

ndegrees of freedom, written as . While formulas for

the probability density function for exist, they are almostnever used directly and are again instead tabulated or

available via statistical packages. The typical use of the isfor finding the distribution of the variance when the mean isknown.

Suppose we know that . As discussed

previously, we know that the mean over our n observations is

distributed as . How is the sample variance

over our n observations distributed? We note that each is normally distributed; thus if we

normalize our sample variance by we have a chi-square distribution:

x

sx

2

sx

sxy2 1

n 1------------ xi x( ) yi y( )

i 1=

n

=

rxy

sxy

sxsy---------=

Ti N 2,( ) N 100 10,( )=

T15--- T1 T2 T5+ + +( )=

T

a T b

T

ax

a2Var x{ }

T N 2 n,( )

T

T = =

T

T

T

Pr 105 T 120 ( )

Pr105 n( )------------------- T

n( )------------------- 120

n( )-------------------

=

Pr 105 100

10( ) 5( )------------------------------- T

n( )------------------- 120 100

2------------------------

=

Pr zl z zu ( )=

14.142( ) 3.536( ) 0.0002= =

Ti

Ti N 100 10,( )

T

N 100 10,( )

xi N 0 1,( ) i 1 2 n, , ,=

y x12

x22 xn

2+ + +=

y n2

2

2

xi N 2,( )

x N 2 n,( ) s2

xi x( ) N 0 2,( )

s2 2


7/22

7

(18)

where one degree of freedom is used in calculation of .Thus, the sample variance for n observations drawn from

is distributed as chi-square as shown in Eq. (18).

For our oxide thickness example where

and we form sample averages as above, then

and 95% of the time the observed sample variable will be

less than 1.78 2. A control chart could thus be constructedto detect when unusually large five-wafer sample variancesoccur.

StudenttDistribution. The Studentt distribution is anotherimportant sampling distribution. The typical use is when we

want to find the distribution of the sample mean when thetrue standard deviation is not known. Consider

(19)

In the above, we have used the definition of the Student tdistribution: if is a random variable, , then

is distributed as a Student t with k degrees of

freedom, or , if . is a random variable

distributed as . As discussed previously, the normalized

sample variance is chi-square-distributed, so that ourdefinition does indeed apply. We thus find that thenormalized sample mean is distributed as a Studentt with n-1 degrees of freedom when we do not know the true standarddeviation and must estimate it based on the sample as well.We note that as , the Student t approaches a unitnormal distribution .

FDistribution and Ratios of Variances. The last samplingdistribution we wish to discuss here is theF distribution. Weshall see that the F distribution is crucial in analysis ofvariance (ANOVA) and experimental design in determiningthe significance of effects, because the F distribution isconcerned with the probability density function for the ratio

of variances. If (that is, is a random variable

distributed as chi-square with u degrees of freedom) and

similarly , then the random variable

(that is, distributed as Fwith uand vdegrees of freedom).The typical use of theF distribution is to compare the spreadof two distributions. For example, suppose that we have twosamples and , where

and . Then

(20)

Point and Interval Estimation

The above population and sampling distributions formthe basis for the statistical inferences we wish to draw inmany semiconductor manufacturing examples. Oneimportant use of the sampling distributions is to estimatepopulation parameters based on some number ofobservations. A point estimate gives a single bestestimated value for a population parameter. For example, the

sample mean is an estimate for the population mean .Good point estimates are representative or unbiased (that is,the expected value of the estimate should be the true value),as well as minimum variance (that is, we desire the estimatorwith the smallest variance in that estimate). Often we restrictourselves to linear estimators; for example, the best linearunbiased estimator (BLUE) for various parameters istypically used.

Many times we would like to determine a confidenceinterval for estimates of population parameters; that is, wewant to know how likely it is that is within some particular

range of . Asked another way, to a desired probability,where will actually lie given an estimate ? Such intervalestimation is considered next; these intervals are used in latersections to discuss hypothesis testing.

First, let us consider the confidence interval forestimation of the mean, when we know the true variance ofthe process. Given that we make n independent andidentically distributed samples from the population, we cancalculate the sample mean as in Eq. (1). As discussed earlier,we know that the sample mean is normally distributed asshown in Eq. (16). We can thus determine the probability

that an observed is larger than by a given amount:

(21)

where is the alpha percentage point for the normalized

variable z (that is, measures how many standard

deviations greater than we must be in order for the

s2

xi x( )2

i n 1( )=

n 1( )s2

2---------------------- n 1

2

s2 2

n 1( )----------------

n 1

2

x

N 2,( )

Ti N 100 10,( )

T sT

2104

24

xi N 2,( )

x s n( )------------------

x n( )-------------------

s ------------------------- N 0 1,( )

1n 1------------n 1

2----------------------------- tn 1 =

z z N 0 1,( )

z y k( )

z y k( ) tk y

k2

s2 2

k

tk N 0 1,( )

y1 u2 y1

y2 v2 Y

y1 uy2 v------------ Fu v,=

x1 x2 xn, , , w1 w2 wm, , ,

xi N x x2,( ) wi N w w

2,( )

sx2 x

2

sw2 w

2---------------- Fn 1 m 1,

x

x

x

x

zx

n( )-------------------=

Pr z z>( ) 1 z( )= =

z

z


8/22

8

integrated probability density to the right of this value toequal ). As shown in Fig. 2, we are usually interested inasking the question the other way around: to a givenprobability [ , e.g., 95%], in what range will thetrue mean lie given an observed ?

(22)

.

For our oxide thickness example, we can now answer animportant question. If we calculate a five-wafer average, howfar away from the average 100 must the value be for us todecide that the process has changed away from the mean

(given a process variance of 10 2)? In this case, we mustdefine the confidence in saying that the process haschanged; this is equivalent to the probability of observing thedeviation by chance. With 95% confidence, we can declarethat the process is different than the mean of 100 if we

observe or :

(23)

A similar result occurs when the true variance is notknown. The confidence interval in this case is

determined using the appropriate Student t samplingdistribution for the sample mean:

(24)

In some cases, we may also desire a confidence intervalon the estimate of variance:

(25)

For example, consider a new etch process which wewant to characterize. We calculate the average etch rate(based on 21-point spatial pre- and post-thicknessmeasurements across the wafer) for each of four wafers weprocess. The wafer averages are 502, 511, 515, and 504 /

min. The sample mean is /min, and the samplestandard deviation is /min. The 95% confidenceintervals for the true mean and std. dev. are/min, and /min. Increasing the number ofobservations (wafers) can improve our estimates andintervals for the mean and variance, as discussed in the nextsection.

Many other cases (e.g., one-sided confidence intervals)can also be determined based on manipulation of theappropriate sampling distributions, or through consultationwith more extensive texts (5).

HYPOTHESIS TESTING

Given an underlying probability distribution, it now becomespossible to answer some simple, but very important,questions about any particular observation. In this section,we formalize the decision-making earlier applied to ouroxide thickness example. Suppose as before that we knowthat oxide thickness is normally distributed, with a mean of100 and standard deviation of 3.162 . We may know thisbased on a very large number of previous historicalmeasurements, so that we can well approximate the truepopulation of oxide thicknesses out of a particular furnace

with these two distribution parameters. We suspectsomething just changed in the equipment, and we want todetermine if there has been an impact on oxide thickness. Wetake a new observation (i.e., run a new wafer and form ouroxide thickness value as usual, perhaps as the average ofnine measurements at fixed positions across the wafer). Thekey question is: What is the probability that we would getthis observation if the process has notchanged, versus theprobability of getting this observation if the process hasindeed changed?

We are conducting a hypothesis test. Based on theobservation, we want to test the hypothesis (label this H1)

that the underlying distribution mean has increased fromby some amount to . The null hypothesisH0is that

nothing has changed and the true mean is still . We are

looking for evidence to convince us thatH1is true.

We can plot the probability density function for each ofthe two hypotheses under the assumption that the variancehas not changed, as shown in Fig. 3. Suppose now that we

Figure 2. Probability density function for the samplemean. The sample mean is unbiased (the expected value ofthe sample mean is the true mean ). The shaded portionin the tail captures the probability that the sample mean isgreater than a distance from the true mean.

100 1 ( )

x

x z 2n

------- x z 2+

n

-------

x z 2 n

-------=

x

f x( ) N 2 n,( )

2 2 x

x

1

T 97.228< T 102.772>

2 Pr z2---

T n( )-------------------

Pr 1.95996T

n( )-------------------

= =

T 1.95996 n( ) 2.7718= =

100 1 ( )

x t 2 n 1,s

n-------

x t 2 n 1,+s

n-------

x t 2 n 1,s

n-------=

n 1( )s2

2 n 1,2

----------------------- 2 n 1( )s2

1 2 n 1,2

------------------------------

x 511=s 6.1=

511 12.6=

3.4 22.6

0 1

0


9/22

9

observe the value . Intuitively, if the value of is closer

to than to , we will more than likely believe that the

value comes from the H0 distribution than the H1distribution. Under a maximum likelihood approach, we cancompare the probability density functions:

(26)

that is, if is greater than for our observation, we reject

the null hypothesis H0 (that is, we accept the alternativehypothesisH1). If we have prior belief (or other knowledge)affecting the a prioriprobabilities of H0and H1, these can

also be used to scale the distributions and to

determine a posteriorprobabilities forH0andH1. Similarly,we can define the acceptance region as the set of values of

for which we accept each respective hypothesis. In Fig. 3,

we have the rule: accept H1 if , and accept H0 if

.

The maximum likelihood approach provides a powerfulbut somewhat complicated means for testing hypotheses. Intypical situations, a simpler approach suffices. We select aconfidence with which we must detect a difference,

and we pick a decision point based on that confidence.For two-sided detection with a unit normal distribution, for

example, we select and as the regions fordeclaring that unusual behavior (i.e., a shift) has occurred.

Consider our historical data on oxide thickness where. To be 95% confident that a new observation

indicates a change in mean, we must observe or

. On the other hand, if we form five-wafer

averages, then we must observe or asin Eq. (23) to convince us (to 95% confidence) that ourprocess has changed.

Alpha and Beta Risk (Type I and Type II Errors)

The hypothesis test gives a clear, unambiguous procedure formaking a decision based on the distributions and

assumptions outlined above. Unfortunately, there may be asubstantial probability of making the wrongdecision. In themaximum likelihood example of Fig. 3, for the singleobservation as drawn we accept the alternative hypothesis

H1. However, examining the distribution corresponding to

H0, we see that a nonzero probability exists of belonging

toH1.Two types of errors are of concern:

(27)

We note that the Type I error (or probability of afalse alarm) is based entirely on our decision rule and doesnot depend on the size of the shift we are seeking to detect.The Type II error (or probability of missing a real shift),on the other hand, depends strongly on the size of the shift.This Type II error can be evaluated for the distributions anddecision rules above; for the normal distribution of Fig. 3,we find

(28)where is the shift we wish to detect.

The power of a statistical test is defined as

(29)

that is, the power of the test is the probability of correctlyrejectingH0. Thus the power depends on the shift we wish

to detect as well as the level of false alarms ( or Type Ierror) we are willing to endure. Power curves are oftenplotted of versus the normalized shift to detect ,

for a given fixed (e.g., ), and as a function ofsampling size.

Hypothesis Testing and Sampling Plans. In the previousdiscussion, we described a hypothesis test for detection of ashift in mean of a normal distribution, based on a singleobservation. In realistic situations, we can often makemultiple observations and improve our ability to make acorrect decision. If we take nobservations, then the sample

Figure 3. Distributions of x under the null hypothesisH0,and under the hypothesis H1 that a positive shift has

occurred such that .

xi xi

0 1

f0 xi( )H0>

xi x* z z 2

xi

xi

Pr (Type I error)=

Pr reject H0 H0is true( ) f0 x( ) xdx*

= =

Pr (Type II error)=

Pr accept H0 H1is true( ) f1 x( ) xd

x*

= =

z 2

---

z 2

---

=

Power 1 Pr reject H0 H0is false( )=

d =

0.05=


10/22

10

mean is normally distributed, but with a reduced variance

as illustrated by our five-wafer average of oxidethickness. It now becomes possible to pick an riskassociated with a decision on the sampling distributionthatis acceptable, and then determine a sample size nin order toachieve the desired level of risk. In the first step, we stillselect based on the risk of false alarm , but now this

determines the actual unnormalized decision point using. We finally pick n(which determines

as just defined) based on the Type II error, which alsodepends on the size of the normalized shift to bedetected and the sample size n. Graphs and tables areindispensable in selecting the sample size for a given dand

; for example, Fig. 4 shows the Type II error associatedwith sampling from a unit normal distribution, for a fixedType I error of 0.05, and as a function of sampling size nandshift to be detected d.

Consider once more our wafer average oxide thickness

. Suppose that we want to detect a 1.58 (or 0.5 ) shift

from the 100 mean. We are willing to accept a false alarm5% of the time ( ) but can accept missing the shift20% of the time ( ). From Fig. 4, we need to take

. Our decision rule is to declare that a shift has

occurred if our 30 sample average or .If it is even more important to detect the 0.5 shift in mean(e.g., with probability of missing the shift), thenfrom Fig. 4 we now need samples, and our decision

rule is to declare a shift if the 40 sample average

or .

Control Chart Application. The concepts of hypothesistesting, together with issues of Type I and Type II error aswell as sample size determination, have one of their mostcommon applications in the design and use of control charts.

For example, the control chart can be used to detect whena significant change from normal operation occurs. Theassumption here is that when the underlying process isoperating under control, the fundamental process population

is distributed normally as . We periodically

draw a sample of nobservations and calculate the average ofthose observations ( ). In the control chart, we essentiallyperform a continuous hypothesis test, where we set the upperand lower control limits (UCLs and LCLs) such that fallsoutside these control charts with probability when theprocess is truly under control (e.g., we usually select togive a 0.27% chance of false alarm). We would then choosethe sample size nso as to have a particular power (that is, aprobability of actually detecting a shift of a given size) aspreviously described. The control limits (CLs) would then be

set at(30)

Note that care must be taken to check key assumptionsin setting sample sizes and control limits. In particular, oneshould verify that each of the nobservations to be utilized informing the sample average (or the sample standarddeviation in an s chart) are independent and identicallydistributed. If, for example, we aggregate or take the averageof n successive wafers, we must be sure that there are nosystematic within-lot sources of variation. If such additionalvariation sources do exist, then the estimate of variance

formed during process characterization (e.g., in our etchexample) will be inflated by accidental inclusion of thissystematic contribution, and any control limits one sets willbe wider than appropriate to detect real changes in theunderlying random distribution. If systematic variation doesindeed exist, then a new statistic should be formed thatblocks against that variation (e.g., by specifying whichwafers should be measured out of each lot), and the controlchart based on the distribution of that aggregate statistic.

EXPERIMENTAL DESIGN AND ANOVA

In this section, we consider application and further

development of the basic statistical methods alreadydiscussed to the problem of designing experiments toinvestigate particular effects, and to aid in the construction ofmodels for use in process understanding and optimization.First, we should recognize that the hypothesis testingmethods are precisely those needed to determine if a newtreatment induces a significant effect in comparison to aprocess with a known distribution. These are known as one-sample tests. In this section, we next consider two-sample

Figure 4. Probability of Type II error ( ) for a unitnormal process with Type I error fixed at , as afunction of sample size n and shift delta (equal to thenormalized shift ) to be detected.

2 n

z 2

x* z+ 2 n( )=

x*

d =

0

0. 1

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1

0 0.5 1 1.5 2

Delta (in units of standard deviation)

n = 2

n = 3

n = 5

n = 10

n = 20

n = 30

n = 40

0.05=

Ti

0.05=

0.20=

n 30=

T 98.9< T 101.1>

0.10=

n 40=

T 99.0

x

xi N 2,( )

x

x

3

CL z 2n

------- =


11/22

11

tests to compare two treatments in an effort to detect atreatment effect. We will then extend to the analysis ofexperiments in which many treatments are to be compared(k-sampletests), and we present the classic tool for studyingthe results --- ANOVA (6).

Comparison of Treatments: Two-Sample Tests

Consider an example where a new process B is to becompared against the process of record (POR), process A. Inthe simplest case, we have enough historical information onprocess A that we assume values for the yield mean and

standard deviation . If we gather a sample of 10 wafers forthe new process, we can perform a simple one-samplehypothesis test using the 10 wafer sampling

distribution and methods already discussed, assuming thatboth process A and B share the same variance.

Consider now the situation where we want to comparetwo new processes. We will fabricate 10 wafers with process

A and another 10 wafers with process B, and then we willmeasure the yield for each wafer after processing. In order toblock against possible time trends, we alternate betweenprocess A and process B on a wafer by wafer basis. We areseeking to test the hypothesis that process B is better thanprocess A, that is H1: , as opposed to the null

hypothesis H0: . Several approaches can be used

(5).

In the first approach, we assume that in each process Aand B we are random sampling from an underlying normaldistribution, with (1) an unknown mean for each process and

(2) a constant known value for the population standarddeviation of . Then , ,

, and . We can then

construct the sampling distribution for the difference inmeans as

(31)

(32)

Even if the original process is moderately non-normal, thedistribution of the difference in sample means will be

approximately normal by the central limit theorem, so wecan normalize as

(33)

allowing us to examine the probability of observing themean difference based on the unit normal

distribution, .

The disadvantage of the above method is that it dependson knowing the population standard deviation . If suchinformation is indeed available, using it will certainlyimprove the ability to detect a difference. In the secondapproach, we assume that our 10 wafer samples are again

drawn by random sampling on an underlying normalpopulation, but in this case we do not assume that we know aprioriwhat the population variance is. In this case, we mustalso build an internal estimate of the variance. First, we

estimate from the individual variances:

(34)

and similarly for . The pooled variance is then

(35)

Once we have an estimate for the population variance, wecan perform ourt test using this pooled estimate:

(36)

One must be careful in assuming that process A and B sharea common variance; in many cases this is not true and moresophisticated analysis is needed (5).

Comparing Several Treatment Means via ANOVA

In many cases, we are interested in comparing severaltreatments simultaneously. We can generalize the approach

discussed above to examine if the observed differences intreatment means are indeed significant, or could haveoccurred by chance (through random sampling of the sameunderlying population).

A picture helps explain what we are seeking toaccomplish. As shown in Fig. 5, the population distributionfor each treatment is shown; the mean can differ because thetreatment is shifted, while we assume that the populationvariance is fixed. The sampling distribution for eachtreatment, on the other hand, may also be different if adifferent number of samples is drawn from each treatment(an unbalanced experiment), or because the treatment is in

fact shifted. In most of the analyses that follow, we willassume balanced experiments; analysis of the unbalancedcase can also be performed (7). It remains important torecognize that particular sample values from the samplingdistributions are what we measure. In essence, we mustcompare the variance between two groups (a measure of thepotential shift between treatments) with the variancewithin each group (a measure of the sampling variance).Only if the shift is large compared to the sampling

A

B A>

B A>

B A=

nA 10= nB 10=

Var yA{ } 2

nA= Var yB{ } 2

nB=

Var yB yA{ } 2

nA------

2

nB------+ 2 1

nA------ 1

nB------+

= =

sB A 1

nA------ 1

nB------+=

z0

yB yA( ) B A( )

1nA------ 1

nB------+

-----------------------------------------------------=

yB yA

Pr z z0>( )

sA2 1

nA 1--------------- yAi yA( )

i 1=

nA

=

sB2

s2 nA 1( )sA

2nB 1( )sB

2+

nA nB 2+----------------------------------------------------------=

t0

yB yA( ) B A( )

s1

nA------ 1

nB------+

-----------------------------------------------------=


12/22

12

variance are we confident that a true effect is in place. Anappropriate sample size must therefore be chosen usingmethods previously described such that the experiment ispowerful enough to detect differences between thetreatments that are of engineering importance. In thefollowing, we discuss the basic methods used to analyze theresults of such an experiment (8).

First, we need an estimate of the within-group variation.Here we again assume that each group is normally

distributed and share a common variance . Then we canform the sum of squares deviations within the tth group

as

(37)

where is the number of observations or samples in group

t. The estimate of the sample variance for each group, alsoreferred to as the mean square, is thus

(38)

where is the degrees of freedom in treatment t. We can

generalize the pooling procedure used earlier to estimate acommon shared variance across ktreatments as

(39)

where is defined as the within-treatment or within-

group mean square and is the total number of

measurements, , or simply if

all treatments consist of the same number of samples, .

In the second step, we want an estimate of between-group variation. We will ultimately be testing the hypothesis

. The estimate of the between-group

variance is

(40)

where is defined as the between treatment mean

square, , and is the overall mean.

We are now in position to ask our key question: Are thetreatments different? If they are indeed different, then thebetween-group variance will be larger than the within-groupvariance. If the treatments are the same, then the between-group variance should be the same as the within-groupvariance. If in fact the treatments are different, we thus find

that

(41)

where is treatment ts effect. That is, is

inflated by some factor related to the difference betweentreatments. We can perform a formal statistical test fortreatment significance. Specifically, we should consider theevidence to be strong for the treatments being different if the

ratio is significantly larger than 1. Under our

assumptions, this should be evaluated using the F

distribution, since .

We can also express the total variation (total deviationsum of squares from the grand mean ) observed in the

data as

(42)

where is recognized as the variance in the data.

Analysis of variance results are usually expressed intabular form, such as shown in Table 3. In addition tocompactly summarizing the sum of squares due to variouscomponents and the degrees of freedom, the appropriate Fratio is shown. The last column of the table usually containsthe probability of observing the statedF ratio under the nullhypothesis. The alternative hypothesis --- that one or more

Figure 5. Pictorial representation of multiple treatmentexperimental analysis. The underlying population for eachtreatment may be different, while the variance for eachtreatment population is assumed to be constant. The darkcircles indicate the sample values drawn from thesepopulations (note that the number of samples may not bethe same in each treatment).

A B C

y

yA

yB

yC

Populationdistribution

2

SSt

SSt ytjyt( )

2

j 1=

nt

=

nt

st2 SSt

t-------

SSt

nt 1-------------= =

t

sR2 1s1

2 2s22 ksk

2+ + +

1 2 k+ + +----------------------------------------------------------

SSR

N k-------------

SSR

R---------= = =

SSR R

N

N n1 n2 nk+ + += N nk=

k n

1 2 k= = =

sT2

ntyt y( )2

t 1=

k

SST

T---------= =

SST T

T k 1= y

sT2 estimates 2

ntt2

k 1( )----------------

t 1=

k

+

t t = sT2

sT2

sR2

sT2

sR2

Fk 1 N k,

SSD

SSD yti y2( )

i 1=

nt

t 1=

k

=

sD2 SSD

D----------

SSD

N 1-------------= =

sD2


13/22

13

treatments have a mean different than the others --- can beaccepted with confidence.

As summarized in Table 3, analysis of variance can alsobe pictured as a decomposition of the variance observed inthe data. That is, we can express the total sum of squareddeviations from the grand mean as , or the

between-group sum of squares added to the within-groupsum of squares. We have presented the results here aroundthe grand mean. In some cases, an alternative presentation is

used where a total sum of squares

is used. In this case the total sum of squares also includes

variation due to the average, , or

. In Table 3, k is the

number of treatments and N is the total number of

observations. The Fratio value is , which is compared

against the F ratio for the degrees of freedom and .

The p value shown in such a table is

Pr( ).

While often not explicitly stated, the above ANOVAassumes a mathematical model:

(43)

where are the treatment means, and are the residuals:

(44)

where is the estimated treatment mean. It is

critical that one check the resulting ANOVA model. First, theresiduals should be plotted against the time order in

which the experiments were performed in an attempt to

distinguish any time trends. While it is possible to randomizeagainst such trends, we lose resolving power if the trend islarge. Second, one should examine the distribution of theresiduals. This is to check the assumption that the residualsare random [that is, independent and identically distributed(IID) and normally distributed with zero mean] and look forgross non-normality. This check should also include anexamination of the residuals for each treatment group. Third,one should plot the residuals versus the estimates and be

especially alert to dependencies on the size of the estimate(e.g., proportional versus absolute errors). Finally, oneshould also plot the residuals against any other variables ofinterest, such as environmental factors that may have beenrecorded. If unusual behavior is noted in any of these steps,additional measures should be taken to stabilize the variance(e.g., by considering transformations of the variables or byreexamining the experiment for other factors that may needto be either blocked against or otherwise included in theexperiment).

Two-Way Analysis of Variance

Suppose we are seeking to determine if various treatmentsare important in determining an output effect, but we mustconduct our experiment in such a way that another variable(which may also impact the output) must also vary. Forexample, suppose we want to study two treatments A and Bbut must conduct the experiments on five different processtools (tools 1-5). In this case, we must carefully design theexperiment to blockagainst the influence of the process toolfactor. We now have an assumed model:

(45)

where are the block effects. The total sum of squares

can now be decomposed as(46)

with degrees of freedom(47)

where b is the number of blocking groups, k is the

number of treatment groups, and . As

before, if the blocks or treatments do in fact include anymean shifts, then the corresponding mean sum of squares(estimates of the corresponding variances) will again beinflated beyond the population variance (assuming thenumber of samples at each treatment is equal):

(48)

Table 3: Structure of the Analysis of Variance Table, for Single Factor (Treatment) Case

Source of Variation Sum of Squares Degrees of Freedom MeanSquare

Fratio Pr(F)

Between treatments

Within treatments

Total about thegrand mean

SST T k 1= sT2

sT2

sR2 p

SSR R N k= sR2

SSD SST SSR+= D T R+ N 1= = sD2

100 1 p( )

SSD SST SSR+=

SS y12

y22 yN

2+ + +=

SSA Ny2=

SS SS A SSD+ SSA SST SSR++= =

sT2

sR2

T R

FT R,sT

2sR

2> 1 k= =

yti t ti+ t ti+ += =

t ti

ti yti yti N 0 2,( )=

yti yt

t+= =

ti

yti t + + i ti+=

i SS

SS SS A SSB+ SST SSR+ +=

bk 1 b 1( ) k 1( ) b 1( ) k 1( )+ + +=

SSB k yi y( )2

i 1=

b

=

sB2 estimates 2 k

i2

b 1( )----------------

i 1=

b

+

sT2 estimates 2

ntt2

k 1( )----------------

t 1=

k

+


14/22

14

where we assume that ksamples are observed for each of thebblocks. So again, we can now test the significance of thesepotentially inflated variances against the pooled estimate

of the variance with the appropriate Ftest as summarizedin Table 4. Note that in Table 4 we have shown the ANOVAtable in the non-mean centered or uncorrected format, sothat due to the average is shown. The table can be

converted to the mean corrected form as in Table 3 byremoving the row, and replacing the total with the

mean corrected total with degrees of

freedom.

Two-Way Factorial Designs

While the above is expressed with the terminology of thesecond factor being considered a blocking factor, preciselythe same analysis pertains if two factors are simultaneouslyconsidered in the experiment. In this case, the blockinggroups are the different levels of one factor, and thetreatment groups are the levels of the other factor. Theassumed analysis of variance above is with the simpleadditive model (that is, assuming that there are nointeractions between the blocks and treatments, or betweenthe two factors). In the blocked experiment, the intent of theblocking factor was to isolate a known (or suspected) sourceof contamination in the data, so that the precision of theexperiment can be improved.

We can remove two of these assumptions in ourexperiment if we so desire. First, we can treat both variablesas equally legitimate factors whose effects we wish toidentify or explore. Second, we can explicitly design theexperiment and perform the analysis to investigateinteractionsbetween the two factors. In this case, the modelbecomes

(49)

where is the effect that depends on both factors

simultaneously. The output can also be expressed as

(50)

where is the overall grand mean, and are the main

effects, and are the interaction effects. In this case, the

subscripts are

The resulting ANOVA table will be familiar; the one keyaddition is explicit consideration of the interaction sum of

squares and mean square. The variance captured in thiscomponent can be compared to the within-group variance asbefore, and be used as a measure of significance for theinteractions, as shown in Table 5.

Another metric often used to assess goodness of fit of

a model is theR2. The fundamental question answered byR2

is how much better does the model do than simply using thegrand average.

(51)

where is the variance or sum of square deviations

around the grand average, and is the variance capturedby the model consisting of experimental factors ( and

) and interactions ( ). AnR2value near zero indicates

that most of the variance is explained by residuals( ) rather than by the model terms ( ),

while an R2 value near 1 indicates that the model sum ofsquare terms capture nearly all of the observed variation in

Table 4: Structure of the Analysis of Variance Table,

for the Case of a Treatment with a Blocking Factor


Fratio Pr(F)

Average (correction factor) 1

Between blocks

Between treatments

Residuals

Total

SSA nk y2=

SSB k yi y( )

2

i 1=

b

= B b 1= s

B

2sB

2sR

2 pB

SST b yt y( )2

t 1=

k

= T k 1= sT

2sT

2sR

2 pT

SSR R b 1( ) k 1( )= sR2

SS N bk= = sD2

sR2

SSA

SSA SS

SSD N 1=

yti j ti ti j+=

ti

ti

t

i

ti

+ + +=

y yt y( ) yi y( ) yti yt yi y+( )+ + +=

t iti

t 1 2 k, , ,= where k is the # levels of first factori 1 2 b, , ,= where b is the # levels of second factorj 1 2 m, , ,= where m is the # replicates at t, i factor

R2 SSM

SSD----------

SST SSB SSI+ +( )SSD

---------------------------------------------= =

SSD

SSM

SST

SSB SSI

SSR SSD SSM= SSM


15/22

15

the data. It is clear that a more sophisticated model withadditional model terms will increase , and thus an

apparent improvement in explanatory power may resultfrom adding model terms. An alternative metric is theadjustedR2, where a penalty is added for the use of degreesof freedom in the model:

(52)

which is more easily interpreted as the fraction of the

variance that is not explained by the residuals ( ). In

important issue, however, is that variance may appear to be

explained when the model in fact does not fit thepopulation. One should formally test for lack of fit (asdescribed in the regression modeling section to follow)

before reporting R2, since the R2 is only a meaningfulmeasure if there is no lack of fit.

Several mnemonics within the factorial design ofexperiments methodology facilitate the rapid or manualestimation of main effects, as well as interaction effects (8).Elements of the methodology include (a) assignment ofhigh (+) and low (-) values for the variables, (b) codingof the experimental combinations in terms of these high and

low levels, (c) randomization of the experimental runs (asalways!), and (d) estimation of effects by attention to theexperimental design table combinations.

For example, one might perform a 23 full factorialdesign [where the superscript indicates the number of factors(three in this case), and the base indicates the number oflevels for each factor (two in this case)], where the factorsare labeled A, B, and C. The unique eight combinations of

these factors can be summarized as in Table 6, with theresulting measured results added during the course of theexperiment.The main effect of a factor can be estimated by

taking the difference between the average of the + level forthat factor and the average of the - levels for that factor ---for example, EffectA= and similarly for the other

main effects. Two level interactions can be found in an

analogous fashion; where

one takes the difference between one factor averages at thehigh and low values of the second factor. Simple methods arealso available in the full factorial case for estimation offactor effect sampling variances, when replicate runs havebeen performed. In the simple case above where only asingle run is performed at each experimental replicate, there

are no simple estimates of the underlying process ormeasurement variance, and so assessment of significance isnot possible. If, however, one performs replicates at the

ith experimental condition, one can pool the individual

estimates of variance at each of the experimental

conditions to gain an overall variance estimate (8):

(53)

where are the degrees of freedom at condition i,

and g is the total number of experimental conditionsexamined.

The sampling variance for an effect estimate can then becalculated; in our previous example we might perform tworuns at each of the eight experimental points, so that

and

(54)

Table 5: Structure of the Analysis of Variance Table,

for Two-factor Case with Interaction Between Factors


Fratio Pr(F)

Between levels offactor 1

Between levels offactor 2

Interaction

Within groups (error)

Total (mean corrected)

SST b yt y( )2

t 1=

k

= T k 1= sT

2sT

2sE

2 pT

SSB k yi y( )2

i 1=

b

= B b 1= sB

2sB

2sE

2 pB

SSI I k 1( ) b 1( )= sI2

sI2

sE2 pI

SSE E bk m 1( )= sE2

SSD bkm 1= sD2

SSM

Adj. R2 1

SSR RSSD D---------------------=

1sR

2

sD2

------ 1 Mean Sq. of ResidualMean Sq. of Total

----------------------------------------------------= =

sR2

sD2

yA+ yA-

InteractionAB12--- yA B+( ) yA B-( )( )=

mi

si2

s2 1s1

2 2s22 gsg

2+ + +

1 2 g+ + +----------------------------------------------------------=

i mi 1=

i 1=

Var{EffectA } Var yA+{ } Var yA-{ }+s

2

8---- s

2

8----+ s

2

4----= = =


16/22

16

These methods can be helpful for rapid estimation ofexperimental results and for building intuition aboutcontrasts in experimental designs; however, statisticalpackages provide the added benefit of assisting not only inquantifying factor effects and interactions, but also inexamination of the significance of these effects and creationof confidence intervals on estimation of factor effects andinteractions. Qualitative tools for examining effects includepareto charts to compare and identify the most importanteffects, and normal or half-normal probability plots todetermine which effects are significant.

In the discussion so far, the experimental factor levels aretreated as either nominal or continuous parameters. Animportant issue is the estimation of the effectof a particularfactor, and determination of the significanceof any observed

effect. Such results are often pictured graphically in asuccinct fashion, as illustrated in Fig. 6 for a two-factorexperiment, where two levels for each of factor A and factorB are examined.

In the full factorial case, interactions can also beexplored, and the effects plots modified to show the effect of

each factor on the output parameter of concern (yield in thiscase), but at different levels of the other factor. Various casesmay result; as shown in Fig. 7 no interaction may beobserved, or a synergistic (or anti-synergistic) interactionmay be present. These analyses are applicable, for example,when the factor levels are discrete or nominal decisions to bemade; perhaps level (+) for factor A is to perform a cleanstep while (-) is to omit the step, and level (+) for factor B is

to use one chemical in the step while level (-) is to use adifferent chemical.

Nested Variance Structures

While blocking factors may seem somewhat esoteric orindicative of an imprecise experiment, it is important torealize that blocking factors do in fact arise extremely

frequently in semiconductor manufacturing. Indeed, mostexperiments or sets of data will actually be taken undersituations where great care must be taken in the analysis ofvariance. In effect, such blocking factors arise due to nestedvariance structuresin typical semiconductor manufacturingwhich restrict the full randomization of an experimentaldesign (9). For example, if one samples die from multiplewafers, it must be recognized that those die reside withindifferent wafers; thus the wafer is itself a blocking factor andmust be accounted for.

For example, consider multiple measurements of oxidefilm thickness across a wafer following oxide deposition.

One might expect that measurements from the same waferwould be more similar to each other than those acrossmultiple wafers; this would correspond to the case where thewithin-wafer uniformity is better than the wafer-to-waferuniformity. On the other hand, one might also find that themeasurements from the corresponding sites on each wafer(e.g., near the lower left edge of the wafer) are more similarto each other than are the different sites across the samewafer; this would correspond to the case where wafer-to-

Table 6: Full Factorial 23Experimental Design (with

Coded Factor Levels)

ExperimentConditionNumber

Factor A Factor B Factor CMeasured

Result

1 - - -

2 - - +3 - + -

4 - + +

5 + - -

6 + - +

7 + + -

8 + + +

Figure 6. Main effects plot for two factor, two levelexperiment. The influence of Factor A on yield is largerthan that of Factor B. Analysis of variance is required inorder to determine if the observed results are significant(and not the result of chance variation).

Factor A

(+)(-)

Yield

Factor B

(+)(-)

Yield

Figure 7. Interaction plot. In case 1, no clear interaction isobserved: Factor B does not appear to change the effect offactor A on the output. Rather, the effects from factor Aand factor B appear to be additive. In case 2, the level offactor B does influence strongly the response of yield tothe high level of factor A.

Factor A

(+)(-)

Yield

Factor A

(+)(-)

Yield B+

B-

B+

B-

ase ase


17/22

17

wafer repeatability is very good, but within-wafer uniformitymay be poor. In order to model the important aspects of theprocess and take the correct improvement actions, it will beimportant to be able to distinguish between such cases andclearly identify where the components of variation arecoming from.

In this section, we consider the situation where webelieve that multiple site measurements within the wafercan be treated as independent and identical samples. This isalmost never the case in reality, and the values of withinwafer variance that result are not true measures of wafervariance, but rather only of the variation across those(typically fixed or preprogrammed) sites measured. In ouranalysis, we are most concerned that the wafer is acting as a

blocking factor, as shown in Fig. 8. That is, we first considerthe case where we find that the five measurements we takeon the wafer are relatively similar, but the wafer-to-waferaverage of these values varies dramatically.This two-levelvariance structure can be described as (9):

(55)

where is the number of measurements taken on each of

wafers, and and are independent normal

random variables drawn from the distributions of wafer-to-wafer variations and measurements taken within the ith

wafer, respectively. In this case, the total variation in oxide

thickness is composed of both within-wafer variance and

wafer-to-wafer variance :

(56)

Note, however, that in many cases (e.g., for control charting),

one is not interested in the total range of all individualmeasurements, but rather, one may desire to understand howthe set of wafer meansitself varies. That is, one is seeking toestimate

(57)

where indicates averages over the measurements within

any one wafer.

Substantial care must be taken in estimating thesevariances; in particular, one can directly estimate the

measurement variance and the wafer average variance

, but must infer the wafer-level variance using

(58)

The within-wafer variance is most clearly understood as the

average over the available wafers of the variance

within each of those iwafers:

(59)

where indicates an average over thejindex (that is, a

within-wafer average):

(60)

The overall variance in wafer averages can be estimated

simply as

(61)

where is the grand mean over all measurements:

(62)

Thus, the wafer-level variance can finally be estimated as

(63)

The same approach can be used for more deeply nestedvariance structures (9,10). For example, a common structureoccurring in semiconductor manufacturing is measurementswithin wafers, and wafers within lots. Confidence limits canalso be established for these estimates of variance (11). Thecomputation of such estimates becomes substantiallycomplicated, however (especially if the data are unbalancedand have different numbers of measurements per samples ateach nested level), and statistical software packages are thebest option.

Figure 8. Nested variance structure. Oxide thicknessvariation consists of both within-wafer and wafer-to-wafercomponents.

Wafer number

OxideThickness(A)

2 31 4 5 6

1010

1020

1030

1040

1050

1060

yij Wi Mj i( )+ +=

Wi N 0 W2,( ) for i 1nW=

Mj i( ) N 0 M2,( ) for j 1nM=

nM

nW Wi Mj i( )

M2

W2

T2 W

2 M2+=

W

2 W2 M

2

nM-------+=

W

M2

W

2 W2

W2

W

2 M2

nM-------=

nW si2

sM2 1

nW------- si

2

i 1=

nW

1nW-------

Yij Yi( )2

nM 1--------------------------------

j 1=

nM

i 1=

nW

= =

Yi

Yi1

nM------- Yij

j 1=

nM

=

sW

2 1nW 1---------------- Yi Y ( )

2

i 1=

nW

=

Y

Y 1

nWnM-------------- Yij

j 1=

nM

i 1=

nW

=

sW2

sW

2 sM2

nM-------=


18/22

18

Several assumptions are made in the analysis ofvariance components for the nested structures above.Perhaps the most important is an assumption of randomsampling within each level of nesting. For example, weassume that each measurement (within each wafer) is IIDand a random sample from within the wafer. If the samemeasurement points are taken on each wafer, however, one isnot in fact truly estimating the within-wafer variation, but

rather the fixed-effect variance between these measurementpoints. Systematic spatial patterns, e.g., bowl, donut, orslanted plane patterns across the wafer often occur due to gasflow, pressure, or other equipment nonuniformities. For thisreason, it is common practice to use a spatially consistentfive-point (or 21-point or 49-point) sampling scheme whenmaking measurements within a wafer. An option which addscomplexity but also adds precision is to model each of thesesites separately (e.g., maintain left, right, top, bottom, andcenter points) and consider how these compare with otherpoints within the wafer, as well as from wafer-to-wafer.Great care is required in such site modeling approaches,however, because one must account for the respectivevariances at multiple levels appropriately in order to avoidbiased estimates (12-14).

Experimental designs that include nested variancesampling plans are also sometimes referred to as split-plotdesigns, in which a factorial design in fact has restrictions onrandomization (7,15). Among the most common restrictionsare those due to spatial factors, and spatial modeling likewiserequires great care (16). Other constraints of the realworld, such as hardware factors, may make completerandomization infeasible due to the time and cost ofinstalling/removing hardware (e.g., in studying alternative

gas distribution plates in a plasma reactor). Methods exist forhandling such constraints (split-plot analysis), but theanalysis cannot be done if the experimental sampling plandoes not follow an appropriate split-plot design. Using split-plot analyses, however, we can resolve components ofvariation due (in the case of sites within wafers) intoresidual, site, wafer, and wafer-site interactions, as well asthe effects of the treatment under consideration. For thesereasons, it can be expected that nested variance structures orsplit-plot designs will receive even greater future attentionand application in semiconductor manufacturing.

Progression of Experimental Designs

It is worth considering when and how various experimentaldesign approaches might best be used. When confrontedwith a new problem which lacks thorough preexistingknowledge, the first step should be screening experimentswhich seek to identify what the important variables are. Atthis stage, only crude predictions of experimental effects asdiscussed above are needed, but often a large number ofcandidate factors (often six or more) are of potential interest.By sacrificing accuracy and certainty in interpretation of the

results (primarily by allowing interactions to confound withother interactions or even with main effects), one can oftengain a great deal of initial knowledge with reasonable cost.In these cases, fractional factorial, Plackett-Burmann, andother designs may be used.

Once a smaller number of effects have been identified,full factorial or fractional factorial designs are often utilized,

together with simplified linear model construction andanalysis of variance. In such cases, the sampling plan mustagain be carefully considered in order to ensure thatsufficient data is taken to draw valid conclusions. It is oftenpossible to test for model lack of fit, which may indicate thatmore thorough experiments are needed or that additionalexperimental design points should be added to the existingexperimental data (e.g., to complete half fractions). The thirdphase is then undertaken, which involves experimentaldesign with small numbers of factors (e.g., two to six) tosupport linear effects, interactions, and second-order(quadratic) model terms. These regression models will beconsidered in the next section.

A variety of sophisticated experimental design methodsare available and applicable to particular problems. Inaddition to factorial and optimal design methods (8,10),robust design approaches (as popularized by Taguchi) arehelpful, particularly when the goal is to aid in theoptimization of the process (17,18).

RESPONSE SURFACE METHODS

In the previous section, we considered the analysis ofvariance, first in the case of single treatments, and then in thecase when blocking factors must also be considered. These

were generalized to consideration of two factor experiments,where the interaction between these factors can also beconsidered. If the factors can take on continuous values, theabove analysis is still applicable. However, in these cases, itis often more convenient and useful to consider or seek tomodel (within the range of factor levels considered) an entireresponse surfacefor the parameter of interest. Specifically,we wish to move from our factorial experiments with anassumed model of the form

(64)

where we can only predict results at discrete prescribed i, jlevels of factors A and B, toward a new model of the process

of the form

(65)

where each is a particular factor of interest.

In this section, we briefly summarize the methods forestimation of the factor response coefficients , as well as

yij Ai Bj ij+ + +=

y 1x1( ) 2x

2( ) + + +=

xj( )

xminj( )

xmaxj( ),[ ]

xj( )

j


19/22

19

for analysis of the significance of such effects based onexperimental design data. We begin with a simple one-parameter model, and we build complexity and capabilityfrom there.

Single Variable Least-squares Regression

The standard approach used here is least-squares regressionto estimate the coefficients in regression models. In thesimple one-parameter case considered here, our actualresponse is modeled as

(66)

where indicates the ithmeasurement, taken at a value

for the explanatory variablex. The estimate for the output is

thus simply where we expect some residual

error . least-squares regression finds the best fit of our

model to the data, where best is that bwhich minimizesthe sum of squared errors between the prediction andobserved ndata values:

(67)

It can easily be shown that for linear problems, a directsolution for bwhich gives is possible and will occur

when the vector of residuals is normal to the vector of

values:

(68)

An important issue is the estimate of the experimentalerror. If we assume that the model structure is adequate, we

can form an estimate of simply as

(69)One may also be interested in the precision of the

estimate b; that is, the variance in b:

(70)

assuming that the residuals are independent and identicallynormally distributed with mean zero. More commonly, one

refers to the standard error , and writes. In a similar fashion, a confidence interval for our

estimate of can be defined by noting that the standardizedvalue for bshould be t-distributed:

(71)

where is the true value for , so that

(72)

Regression results should also be framed in an analysisof variance framework. In the simple one factor case, asimple ANOVA table might be as shown in Table 7. In thiscase, is the sum of squared values of the estimates, and

is an estimate of the variance explained by the model,

where our model is purely linear (no intercept term) as givenin Eq. (66). In order to test significance, we must compare

the ratio of this value to the residual variance using the

appropriateF test. In the case of a single variable, we note

that theF test degenerates into thet test: , and a t-

test can be used to evaluate the significance of the modelcoefficient.

In the analysis above, we have assumed that the valuesfor have been selected at random and are thus unlikely to

be replicated. In many cases, it may be possible to repeat theexperiment at particular values, and doing so gives us theopportunity to decompose the residual error into twocontributions. The residual sum of squares can be

broken into a component due to lack of fit and a

component due to pure

ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999

Documents

Transcript of ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999