ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
-
Upload
sakshiyadav3880 -
Category
Documents
-
view
217 -
download
0
Transcript of ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
1/22
1
STATISTICAL METHODS FOR
SEMICONDUCTOR MANUFACTURING
DUANE S. B
ONING
Massachusetts Institute of TechnologyCambridge, MA
J
ERRY
S
TEFANI
AND
S
TEPHANIE
W. B
UTLER
Texas InstrumentsDallas, TX
Semiconductor manufacturing increasingly depends on theuse of statistical methods in order to develop and explorenew technologies, characterize and optimize existingprocesses, make decisions about evolutionary changes toimprove yields and product quality, and monitor andmaintain well-functioning processes. Each of these activitiesrevolves around data: Statistical methods are essential in theplanning of experiments to efficiently gather relevant data,the construction and evaluation of models based on data, anddecision-making using these models.
THE NATURE OF DATA IN SEMICONDUCTOR MANU-
FACTURING
An overview of the variety of data utilized in semiconductormanufacturing helps to set the context for application of thestatistical methods to be discussed here. Most data in atypical semiconductor fabrication plant (or fab) consists ofin-line and end-of-line data used for process control andmaterial disposition. In-line data includes measurementsmade on wafer characteristics throughout the process(performed on test or monitor wafers, as well as product
wafers), such as thicknesses of thin films after deposition,etch, or polish; critical dimensions (such as polysiliconlinewidths) after lithography and etch steps; particle counts,sheet resistivities, etc. In addition to in-line wafermeasurements, a large amount of data is collected on processand equipment parameters. For example, in plasma etchprocesses time series of the intensities of plasma emissionare often gathered for use in detecting the endpoint(completion) of etch, and these endpoint traces (as well asnumerous other parameters) are recorded. At the end-of-line,volumes of test data are taken on completed wafers. Thisincludes both electrical test (or etest) data reportingparametric information about transistor and other deviceperformance on die and scribe line structures; and sort oryield tests to identify which die on the wafer function orperform at various specifications (e.g., binning by clockspeed). Such data, complemented by additional sources (e.g.,fab facility environmental data, consumables or vendor data,offline equipment data) are all crucial not only for day to daymonitoring and control of the fab, but also to investigateimprovements, characterize the effects of uncontrolledchanges (e.g., a change in wafer supply parameters), and to
redesign or optimize process steps.
For correct application of statistical methods, one mustrecognize the idiosyncrasies of such measurements arising insemiconductor manufacturing. In particular, the usualassumptions in statistical modeling of independence andidenticality of measurements often do not hold, or only holdunder careful handling or aggregation of the data. Consider
the measurement of channel length (polysilicon criticaldimension or CD) for a particular size transistor. First, theposition of the transistor within the chip may affect thechannel length: Nominally identical structures mayindirectly be affected by pattern dependencies or otherunintentional factors, giving rise to within die variation. Ifone wishes to compare CD across multiple die on the wafer(e.g., to measure die to die variation), one must thus becareful to measure the same structure on each die to becompared. This same theme of blocking againstextraneous factors to study the variation of concern isrepeated at multiple length and time scales. For example, wemight erroneously assume that every die on the wafer isidentical, so that any random sample of, say, five die acrossthe wafer can be averaged to give an accurate value for CD.Because of systematic wafer level variation (e.g., due tolarge scale equipment nonuniformities), however, eachmeasurement of CD is likely to also be a function of theposition of the die on the wafer. The average of these fivepositions can still be used as an aggregate indicator for theprocess, but care must be taken that the same five positionsare always measured, and this average value cannot be takenas indicative of any one measurement. Continuing with thisexample, one might also take five die averages of CD forevery wafer in a 25 wafer lot. Again, one must be careful to
recognize that systematic wafer-to-wafer variation sourcesmay exist. For example, in batch polysilicon depositionprocesses, gas flow and temperature patterns within thefurnace tube will give rise to systematic wafer positiondependencies; thus the 25 different five-die averages are notindependent. Indeed, if one constructs a control chart underthe assumption of independence, the estimate of five-dieaverage variance will be erroneously inflated (it will alsoinclude the systematic within-lot variation), and the chartwill have control limits set too far apart to detect importantcontrol excursions. If one is careful to consistently measuredie-averages on the same wafers (e.g., the 6th, 12th, and 18thwafers) in each lot and aggregate these as a lot average,
one can then construct a historical distribution of thisaggregate observation (where an observation thus consistsof several measurements) and correctly design a controlchart. Great care must be taken to account for or guardagainst these and other systematic sources of variation whichmay give rise to lot-to-lot variation; these include tooldependencies, consumables or tooling (e.g., the specificreticle used in photolithography), as well as otherunidentified time-dependencies.
Encyclopedia of Electrical Engineering, J. G. Webster, Ed., Wiley, to appear, Feb. 1999.
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
2/22
2
Formal statistical assessment is an essential complementto engineering knowledge of known and suspected causaleffects and systematic sources of variation. Engineeringknowledge is crucial to specify data collection plans thatensure that statistical conclusions are defensible. If data iscollected but care has not been not taken to make faircomparisons, then the results will not be trusted no matterwhat statistical method is employed or amount of datacollected. At the same time, correct application of statisticalmethods are also crucial to correct interpretation of theresults. Consider a simple scenario: The deposition area in aproduction fab has been running a standard process recipefor several months, and has monitored defect counts on eachwafer. An average of 12 defects per 8 in. wafer has beenobserved over that time. The deposition engineer believesthat a change in the gasket (from a different supplier) on the
deposition vacuum chamber can reduce the number ofdefects. The engineer makes the change and runs two lots of25 wafers each with the change (observing an average of 10defects per wafer), followed by two additional lots with theoriginal gasket type (observing an average of 11 defects perwafer). Should the change be made permanently? Adifference in output has been observed, but a key questionremains: Is the change statistically significant? That is tosay, considering the data collected and the systemscharacteristics, has a change really occurred or might thesame results be explained by chance?
Overview of Statistical Methods
Different statistical methods are used in semiconductormanufacturing to understand and answer questions such asthat posed above and others, depending on the particularproblem. Table 1 summarizes the most common methodsand for what purpose they are used, and it serves as a briefoutline of this article. The key engineering tasks involved insemiconductor manufacturing drawing on these methods aresummarized in Table 2. In all cases, the issues of samplingplans and significance of the findings must be considered,
and all sections will periodically address these issues. Tohighlight these concepts, note that in Table 1 the wordsdifferent, same, good, and improve are mentioned.These words tie together two critical issues: significance andengineering importance. When applying statistics to the data,one first determines if a statistically significant differenceexists, otherwise the data (or experimental effects) areassumed to be the same. Thus, what is really tested iswhether there is a difference that is big enough to findstatistically. Engineering needs and principles determinewhether that difference is good, the difference actuallymatters, or the cost of switching to a different process will beoffset by the estimated improvements. The size of thedifference that can be statistically seen is determined by thesampling plans and the statistical method used.Consequently, engineering needs must enter into the design
of the experiments (sampling plan) so that the statistical testwill be able to see a difference of the appropriate size.Statistical methods provide the means to determine samplingand significance to meet the needs of the engineer andmanager.
In the first section, we focus on the basic underlyingissues of statistical distributions, paying particular attentionto those distributions typically used to model aspects ofsemiconductor manufacturing, including the indispensableGaussian distribution as well as binomial and Poissondistributions (which are key to modeling of defect and yieldrelated effects). An example use of basic distributions is to
estimate the interval of oxide thickness in which the engineeris confident that 99% of wafers will reside; based on thisinterval, the engineer could then decide if the process ismeeting specifications and define limits or tolerances forchip design and performance modeling.
In the second section, we review the fundamental tool ofstatistical inference, the hypothesis test as summarized inTable 1. The hypothesis test is crucial in detecting
Table 1: Summary of Statistical Methods Typically Used in Semiconductor Manufacturing
Topic Statistical Method Purpose
1 Statistical distributions Basic material for statistical tests. Used to characterize a populationbased upon a sample.
2 Hypothesis testing Decide whether data under investigation indicates that elementsof concern are the same or different.
3 Experimental design andanalysis of variance
Determine significance of factors and models; decomposeobserved variation into constituent elements.
4 Response surface modeling Understand relationships, determine process margin, andoptimize process.
5 Categorical modeling Use when result or response is discrete (such as very rough,rough, or smooth). Understand relationships, determineprocess margin, and optimize process.
6 Statistical process control Determine if system is operating as expected.
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
3/22
3
differences in a process. Examples include: determining ifthe critical dimensions produced by two machines aredifferent; deciding if adding a clean step will decrease thevariance of the critical dimension etch bias; determining ifno appreciable increase in particles will occur if the intervalbetween machine cleans is extended by 10,000 wafers; ordeciding if increasing the target doping level will improve adevices threshold voltage.
In the third section, we expand upon hypothesis testingto consider the fundamentals of experimental design andanalysis of variance, including the issue of sampling requiredto achieve the desired degree of confidence in the existence
of an effect or difference, as well as accounting for the risk innot detecting a difference. Extensions beyond single factorexperiments to the design of experiments which screen foreffects due to several factors or their interactions are thendiscussed, and we describe the assessment of suchexperiments using formal analysis of variance methods.Examples of experimental design to enable decision-makingabound in semiconductor manufacturing. For example, onemight need to decide if adding an extra film or switchingdeposition methods will improve reliability, or decide whichof three gas distribution plates (each with different holepatterns) provides the most uniform etch process.
In the fourth section, we examine the construction ofresponse surface or regression models of responses as afunction of one or more continuous factors. Of particularimportance are methods to assess the goodness of fit anderror in the model, which are essential to appropriate use ofregression models in optimization or decision-making.Examples here include determining the optimal values fortemperature and pressure to produce wafers with no morethan 2% nonuniformity in gate oxide thickness, or
determining if typical variations in plasma power will causeout-of-specification materials to be produced.
Finally, we note that statistical process control (SPC) formonitoring the normal or expected behavior of a process isa critical statistical method (1,2). The fundaments ofstatistical distributions and hypothesis testing discussed herebear directly on SPC; further details on statistical processmonitoring and process optimization can be found inS
EMICONDUCTOR
F
ACTORY
C
ONTROL
AND
O
PTIMIZATION
.
STATISTICAL DISTRIBUTIONS
Semiconductor technology development andmanufacturing are often concerned with both continuousparameters (e.g., thin-film thicknesses, electricalperformance parameters of transistors) and discreteparameters (e.g., defect counts and yield). In this section, webegin with a brief review of the fundamental probabilitydistributions typically encountered in semiconductormanufacturing, as well as sampling distributions which arisewhen one calculates statistics based on multiplemeasurements (3). An understanding of these distributions iscrucial to understanding hypothesis testing, analysis ofvariance, and other inferencing and statistical analysismethods discussed in later sections.
Descriptive Statistics
Descriptive statistics are often used to concisely presentcollections of data. Such descriptive statistics are basedentirely (and only) on the available empirical data, and theydo not assume any underlying probability model. Suchdescriptions include histograms (plots of relative frequency
versus measured values or value ranges in some
Table 2: Engineering Tasks in Semiconductor Manufacturing and Statistical Methods Employed
Engineering Task Statistical Method
Process and equipment control (e.g.,CDs, thickness, deposition/removalrates, endpoint time, etest and sort data,yield)
Control chart for (for wafer or lot mean) and s
(for within-wafer unifor-mity) based on the distribution of each parameter. Control limits and con-trol rules are based on the distribution and time patterns in the parameter.
Equipment matching ANOVA on and/or s
over sets of equipment, fabs, or vendors.
Yield/performance modelingand prediction
Regression of end-of-line measurement as responseagainst in-line measurements as factors.
Debug ANOVA, categorical modeling, correlation to find most influential factors toexplain the difference between good versus bad material.
Feasibility of process improvement One-sample comparisons on limited data; single lot factorial experimentdesign; response surface models over a wider range of settings looking foropportunities for optimization, or to establish process window over a nar-rower range of settings.
Validation of process improvement Two-sample comparison run over time to ensure a change is robust andworks over multiple tools, and no subtle defects are incurred by change.
x
x
i xi
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
4/22
4
parameter), as well as calculation of the mean:
(1)
(where is the number of values observed), calculation ofthe median (value in the middle of the data with an equalnumber of observations below and above), and calculation ofdata percentiles. Descriptive statistics also include thesample variance and sample standard deviation:
(2)
(3)
A drawback to such descriptive statistics is that theyinvolve the study of observed data only and enable us todraw conclusions which relate only to that specific data.Powerful statistical methods, on the other hand, have cometo be based instead on probability theory; this allows us torelate observations to some underlying probability modeland thus make inferences about the population (thetheoretical set of all possible observations) as well as thesample (those observations we have in hand). It is the use ofthese models that give computed statistics (such as the mean)explanatory power.
Probability Model
Perhaps the simplest probability model of relevance insemiconductor manufacturing is the Bernoulli distribution. ABernoulli trial is an experiment with two discrete outcomes:success or failure. We can model
the a priori
probability(based on historical data or theoretical knowledge) of asuccess simply as p
. For example, we may have aggregate
historical data that tells us that line yield is 95% (i.e., that95% of product wafers inserted in the fab successfullyemerge at the end of the line intact). We make the leap fromthis descriptive information to an assumption of anunderlying probability model: we suggest that theprobability of any one wafer making it through the line isequal to 0.95. Based on that probability model, we canpredict an outcome for a new wafer which has not yet beenprocessed and was not part of the original set ofobservations. Of course, the use of such probability modelsinvolves assumptions -- for example that the fab and allfactors affecting line yield are essentially the same for thenew wafer as for those used in constructing the probabilitymodel. Indeed, we often find that each wafer is not
independent of other wafers, and such assumptions must beexamined carefully.
Normal (Gaussian) Distribution. In addition to discreteprobability distributions, continuous distributions also play acrucial role in semiconductor manufacturing. Quite often,one is interested in the probability density function (or pdf)for some parametric value. The most important continuous
distribution (in large part due to the central limit theorem) isthe Gaussian or normal distribution. We can write that arandom variable x
is distributed as a normal distribution
with mean and variance as . Theprobability density function for x is given by
(4)
which is also often discussed in unit normal form through the
normalization , so that :
(5)
Given a continuous probability density function, onecan talk about the probability of finding a value in somerange. Consider the oxide thickness at the center of a waferafter rapid thermal oxidation in a single wafer system. If thisoxide thickness is normally distributed with and
2
(or a standard deviation of 3.16 ), then theprobability of any one such measurement x
falling between105 and 120 can be determined as
(6)
where is the cumulative density function for the unitnormal, which is available via tables or statistical analysispackages.
We now briefly summarize other common discreteprobability mass functions (pmfs) and continuousprobability density functions (pdfs) that arise in
semiconductor manufacturing, and then we turn to samplingdistributions that are also important.
Binomial Distribution. Very often we are interested in thenumber of successes in repeated Bernoulli trials (that is,repeated succeed or fail trials). If x
is the number ofsuccesses in n
trials, then x
is distributed as a binomialdistribution , where p
is the probability of eachindividual success. The pmf is given by
x1n--- xi
i 1=
n
=
n
sx2 Sample Var x{ } 1
n 1------------ xi x( )
2
i 1=
n
= =
sx Sample std. dev. x{ } sx2= =
2 x N 2,( )
f x( ) 1 2--------------e
12
------ x
------------
2
=
zx
------------= z N 1 0,( )
f z( ) 12
----------e
12---z2
=
100=
2 10=
Pr 105 x 120 ( ) 1 2--------------e
12--- x
------------
2
xd
105
120
=
Pr105
------------------ x
------------ 120
------------------
=
Pr zl z zu ( )1
2----------e
12---z2
zd
zl
zu
= =
1
2----------e
12---z2
zd
zu
1
2----------e
12---z2
zd
zl
=
zu( ) zl( )=
6.325( ) 1.581( ) 0.0569= =
z( )
x B n p,( )
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
5/22
5
(7)
where nchoosex is
(8)
For example, if one is starting a 25-wafer lot in the fababove, one may wish to know what is the probability thatsome numberx(xbeing between 0 and 25) of those wafers
will survive. For the line yield model of p = 95%, theseprobabilities are shown in Fig. 1. When nis very large (muchlarger than 25), the binomial distribution is wellapproximated by a Gaussian distribution.
A key assumption in using the Bernoulli distribution isthat each wafer is independent of the other wafers in the lot.
One often finds that many lots complete processing with allwafers intact, and other lots suffer multiple damaged wafers.While an aggregate line yield of 95% may result,independence does not hold and alternative approaches suchas formation of empirical distributions should be used.
Poisson Distribution. A third discrete distribution is highlyrelevant to semiconductor manufacturing. An approximationto the binomial distribution that applies when nis large andpis small is the Poisson distribution:
(9)
for integer and .
For example, one can examine the number of chips thatfail on the first day of operation: The probabilityp of failurefor any one chip is (hopefully) exceedingly small, but onetests a very large number nof chips, so that the observationof the mean number of failed chips is Poissondistributed. An even more common application of thePoisson model is in defect modeling. For example, if defects
are Poisson-distributed with a mean defect count of 3particles per 200 mm wafer, one can ask questions about theprobability of observingx[e.g., Eq. (9)] defects on a sample
wafer. In this case, , or lessthan 0.3% of the time would we expect to observe exactly 9defects. Similarly, the probability that 9 or more defects are
observed is .
In the case of defect modeling, several otherdistributions have historically been used, including theexponential, hypergeometric, modified Poisson, and negativebinomial. Substantial additional work has been reported inyield modeling to account for clustering, and to understandthe relationship between defect models and yield (e.g., seeRef. 4).
Population Versus Sample Statistics
We now have the beginnings of a statistical inference theory.
Before proceeding with formal hypothesis testing in the nextsection, we first note that the earlier descriptive statistics ofmean and variance take on new interpretations in theprobabilistic framework. The mean is the expectation (orfirst moment) over the distribution:
(10)
for continuous pdfs and discrete pmfs, respectively.Similarly, the variance is the expectation of the squareddeviation from the mean (or the second central moment)over the distribution:
(11)
Further definitions from probability theory are also highlyuseful, including the covariance:
(12)
where x and y are each random variables with their ownprobability distributions, as well as the related correlationcoefficient:
(13)
The above definitions for the mean, variance,covariance, and correlation all relate to the underlying or
Figure 1. Probabilities for number of wafers survivingfrom a lot of 25 wafers, assuming a line yield of 90%,calculated using the binomial distribution.
f x p n, ,( ) nx px 1 p( )n x=
nx n!
x! n x( )!-----------------------=
0
0.05
0. 1
0.15
0. 2
0.25
0. 3
0.35
0. 4
0 5 1 0 1 5 2 0 2 5
Wafers surviving
f x ,( ) e x
x!--------------=
0 1 2 , , ,= np
np=
f 9 3,( ) e 3 39( ) 9! 0.0027= =
1 f x 3,( )x 0=
8
0.0038=
x E x{ } xf x( ) xd
= =
xi p rxi( )i 1=
n
=
x2 Var x{ } x E x{ }( )2f x( ) xd
= =
xi E xi{ }( )2
p rxi( )i 1=
n
=
xy2 Cov x y,{ } E x E x{ }( ) y E y{ }( ){ }= =
E xy{ } E x{ }E y{ }=
xy Corr x y,{ }Cov x y,{ }
Var x{ }Var y{ }------------------------------------------
xy2
xy------------= = =
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
6/22
6
assumedpopulation. When one only has a sample (that is, afinite number of values drawn from some population), onecalculates the corresponding sample statistics. These are nolonger descriptive of only the sample we have; rather,these statistics are now estimates of parameters in aprobability model. Corresponding to the populationparameters above, the sample mean is given by Eq. (1), the
sample variance by Eq. (2), the sample std. dev. by Eq.(3), the sample covariance is given by
(14)
and the sample correlation coefficient is given by
(15)
Sampling Distributions
Sampling is the act of making inferences about populationsbased on some number of observations. Random sampling isespecially important and desirable, where each observationis independent and identically distributed. A statistic is afunction of sample data which contains no further unknowns(e.g., the sample mean can be calculated from theobservations and has no further unknowns). It is important tonote that a sample statistic is itself a random variable and hasa sampling distribution which is usually differentthan theunderlying population distribution. In order to reason aboutthe likelihood of observing a particular statistic (e.g., themean of five measurements), one must be able to constructthe underlying sampling distribution.
Sampling distributions are also intimately bound up
with estimation of population distribution parameters. Forexample, suppose we know that the thickness of gate oxide(at the center of the wafer) is normally distributed:
. We sample five random wafers
and compute the mean oxide thickness
. We now have two key questions:
(1) What is the distributionof ? (2) What is the probability
that ?
In this case, given the expression for above, we can
use the fact that the variance of a scaled random variableis simply , and the variance of a sum ofindependent random variables is the sum of the variances
(16)where by the definition of the mean. Thus,
when we want to reason about the likelihood of observing
values of (that is, averages of five sample measurements)
lying within particular ranges, we must be sure to use the
distribution for rather than that for the underlyingdistribution T. Thus, in this case the probability of finding a
value between 105 and 120 is
(17)
which is relatively unlikely. Compare this to the result fromEq. (6) for the probability 0.0569 of observing a single value(rather than a five sample average) in the range 105 to 120.
Note that the same analysis would apply if was itself
formed as an average of underlying oxide thickness
measurements (e.g., from a nine point pattern across thewafer), if that average and each of the five
wafers used in calculating the sample average areindependent and also distributed as .
Chi-Square Distribution and Variance Estimates. Severalother vitally important distributions arise in sampling, andare essential to making statistical inferences in experimentaldesign or regression. The first of these is the chi-squaredistribution. If for and
, thenyis distributed as chi-square with
ndegrees of freedom, written as . While formulas for
the probability density function for exist, they are almostnever used directly and are again instead tabulated or
available via statistical packages. The typical use of the isfor finding the distribution of the variance when the mean isknown.
Suppose we know that . As discussed
previously, we know that the mean over our n observations is
distributed as . How is the sample variance
over our n observations distributed? We note that each is normally distributed; thus if we
normalize our sample variance by we have a chi-square distribution:
x
sx
2
sx
sxy2 1
n 1------------ xi x( ) yi y( )
i 1=
n
=
rxy
sxy
sxsy---------=
Ti N 2,( ) N 100 10,( )=
T15--- T1 T2 T5+ + +( )=
T
a T b
T
ax
a2Var x{ }
T N 2 n,( )
T
T = =
T
T
T
Pr 105 T 120 ( )
Pr105 n( )------------------- T
n( )------------------- 120
n( )-------------------
=
Pr 105 100
10( ) 5( )------------------------------- T
n( )------------------- 120 100
2------------------------
=
Pr zl z zu ( )=
14.142( ) 3.536( ) 0.0002= =
Ti
Ti N 100 10,( )
T
N 100 10,( )
xi N 0 1,( ) i 1 2 n, , ,=
y x12
x22 xn
2+ + +=
y n2
2
2
xi N 2,( )
x N 2 n,( ) s2
xi x( ) N 0 2,( )
s2 2
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
7/22
7
(18)
where one degree of freedom is used in calculation of .Thus, the sample variance for n observations drawn from
is distributed as chi-square as shown in Eq. (18).
For our oxide thickness example where
and we form sample averages as above, then
and 95% of the time the observed sample variable will be
less than 1.78 2. A control chart could thus be constructedto detect when unusually large five-wafer sample variancesoccur.
StudenttDistribution. The Studentt distribution is anotherimportant sampling distribution. The typical use is when we
want to find the distribution of the sample mean when thetrue standard deviation is not known. Consider
(19)
In the above, we have used the definition of the Student tdistribution: if is a random variable, , then
is distributed as a Student t with k degrees of
freedom, or , if . is a random variable
distributed as . As discussed previously, the normalized
sample variance is chi-square-distributed, so that ourdefinition does indeed apply. We thus find that thenormalized sample mean is distributed as a Studentt with n-1 degrees of freedom when we do not know the true standarddeviation and must estimate it based on the sample as well.We note that as , the Student t approaches a unitnormal distribution .
FDistribution and Ratios of Variances. The last samplingdistribution we wish to discuss here is theF distribution. Weshall see that the F distribution is crucial in analysis ofvariance (ANOVA) and experimental design in determiningthe significance of effects, because the F distribution isconcerned with the probability density function for the ratio
of variances. If (that is, is a random variable
distributed as chi-square with u degrees of freedom) and
similarly , then the random variable
(that is, distributed as Fwith uand vdegrees of freedom).The typical use of theF distribution is to compare the spreadof two distributions. For example, suppose that we have twosamples and , where
and . Then
(20)
Point and Interval Estimation
The above population and sampling distributions formthe basis for the statistical inferences we wish to draw inmany semiconductor manufacturing examples. Oneimportant use of the sampling distributions is to estimatepopulation parameters based on some number ofobservations. A point estimate gives a single bestestimated value for a population parameter. For example, the
sample mean is an estimate for the population mean .Good point estimates are representative or unbiased (that is,the expected value of the estimate should be the true value),as well as minimum variance (that is, we desire the estimatorwith the smallest variance in that estimate). Often we restrictourselves to linear estimators; for example, the best linearunbiased estimator (BLUE) for various parameters istypically used.
Many times we would like to determine a confidenceinterval for estimates of population parameters; that is, wewant to know how likely it is that is within some particular
range of . Asked another way, to a desired probability,where will actually lie given an estimate ? Such intervalestimation is considered next; these intervals are used in latersections to discuss hypothesis testing.
First, let us consider the confidence interval forestimation of the mean, when we know the true variance ofthe process. Given that we make n independent andidentically distributed samples from the population, we cancalculate the sample mean as in Eq. (1). As discussed earlier,we know that the sample mean is normally distributed asshown in Eq. (16). We can thus determine the probability
that an observed is larger than by a given amount:
(21)
where is the alpha percentage point for the normalized
variable z (that is, measures how many standard
deviations greater than we must be in order for the
s2
xi x( )2
i n 1( )=
n 1( )s2
2---------------------- n 1
2
s2 2
n 1( )----------------
n 1
2
x
N 2,( )
Ti N 100 10,( )
T sT
2104
24
xi N 2,( )
x s n( )------------------
x n( )-------------------
s ------------------------- N 0 1,( )
1n 1------------n 1
2----------------------------- tn 1 =
z z N 0 1,( )
z y k( )
z y k( ) tk y
k2
s2 2
k
tk N 0 1,( )
y1 u2 y1
y2 v2 Y
y1 uy2 v------------ Fu v,=
x1 x2 xn, , , w1 w2 wm, , ,
xi N x x2,( ) wi N w w
2,( )
sx2 x
2
sw2 w
2---------------- Fn 1 m 1,
x
x
x
x
zx
n( )-------------------=
Pr z z>( ) 1 z( )= =
z
z
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
8/22
8
integrated probability density to the right of this value toequal ). As shown in Fig. 2, we are usually interested inasking the question the other way around: to a givenprobability [ , e.g., 95%], in what range will thetrue mean lie given an observed ?
(22)
.
For our oxide thickness example, we can now answer animportant question. If we calculate a five-wafer average, howfar away from the average 100 must the value be for us todecide that the process has changed away from the mean
(given a process variance of 10 2)? In this case, we mustdefine the confidence in saying that the process haschanged; this is equivalent to the probability of observing thedeviation by chance. With 95% confidence, we can declarethat the process is different than the mean of 100 if we
observe or :
(23)
A similar result occurs when the true variance is notknown. The confidence interval in this case is
determined using the appropriate Student t samplingdistribution for the sample mean:
(24)
In some cases, we may also desire a confidence intervalon the estimate of variance:
(25)
For example, consider a new etch process which wewant to characterize. We calculate the average etch rate(based on 21-point spatial pre- and post-thicknessmeasurements across the wafer) for each of four wafers weprocess. The wafer averages are 502, 511, 515, and 504 /
min. The sample mean is /min, and the samplestandard deviation is /min. The 95% confidenceintervals for the true mean and std. dev. are/min, and /min. Increasing the number ofobservations (wafers) can improve our estimates andintervals for the mean and variance, as discussed in the nextsection.
Many other cases (e.g., one-sided confidence intervals)can also be determined based on manipulation of theappropriate sampling distributions, or through consultationwith more extensive texts (5).
HYPOTHESIS TESTING
Given an underlying probability distribution, it now becomespossible to answer some simple, but very important,questions about any particular observation. In this section,we formalize the decision-making earlier applied to ouroxide thickness example. Suppose as before that we knowthat oxide thickness is normally distributed, with a mean of100 and standard deviation of 3.162 . We may know thisbased on a very large number of previous historicalmeasurements, so that we can well approximate the truepopulation of oxide thicknesses out of a particular furnace
with these two distribution parameters. We suspectsomething just changed in the equipment, and we want todetermine if there has been an impact on oxide thickness. Wetake a new observation (i.e., run a new wafer and form ouroxide thickness value as usual, perhaps as the average ofnine measurements at fixed positions across the wafer). Thekey question is: What is the probability that we would getthis observation if the process has notchanged, versus theprobability of getting this observation if the process hasindeed changed?
We are conducting a hypothesis test. Based on theobservation, we want to test the hypothesis (label this H1)
that the underlying distribution mean has increased fromby some amount to . The null hypothesisH0is that
nothing has changed and the true mean is still . We are
looking for evidence to convince us thatH1is true.
We can plot the probability density function for each ofthe two hypotheses under the assumption that the variancehas not changed, as shown in Fig. 3. Suppose now that we
Figure 2. Probability density function for the samplemean. The sample mean is unbiased (the expected value ofthe sample mean is the true mean ). The shaded portionin the tail captures the probability that the sample mean isgreater than a distance from the true mean.
100 1 ( )
x
x z 2n
------- x z 2+
n
-------
x z 2 n
-------=
x
f x( ) N 2 n,( )
2 2 x
x
1
T 97.228< T 102.772>
2 Pr z2---
T n( )-------------------
Pr 1.95996T
n( )-------------------
= =
T 1.95996 n( ) 2.7718= =
100 1 ( )
x t 2 n 1,s
n-------
x t 2 n 1,+s
n-------
x t 2 n 1,s
n-------=
n 1( )s2
2 n 1,2
----------------------- 2 n 1( )s2
1 2 n 1,2
------------------------------
x 511=s 6.1=
511 12.6=
3.4 22.6
0 1
0
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
9/22
9
observe the value . Intuitively, if the value of is closer
to than to , we will more than likely believe that the
value comes from the H0 distribution than the H1distribution. Under a maximum likelihood approach, we cancompare the probability density functions:
(26)
that is, if is greater than for our observation, we reject
the null hypothesis H0 (that is, we accept the alternativehypothesisH1). If we have prior belief (or other knowledge)affecting the a prioriprobabilities of H0and H1, these can
also be used to scale the distributions and to
determine a posteriorprobabilities forH0andH1. Similarly,we can define the acceptance region as the set of values of
for which we accept each respective hypothesis. In Fig. 3,
we have the rule: accept H1 if , and accept H0 if
.
The maximum likelihood approach provides a powerfulbut somewhat complicated means for testing hypotheses. Intypical situations, a simpler approach suffices. We select aconfidence with which we must detect a difference,
and we pick a decision point based on that confidence.For two-sided detection with a unit normal distribution, for
example, we select and as the regions fordeclaring that unusual behavior (i.e., a shift) has occurred.
Consider our historical data on oxide thickness where. To be 95% confident that a new observation
indicates a change in mean, we must observe or
. On the other hand, if we form five-wafer
averages, then we must observe or asin Eq. (23) to convince us (to 95% confidence) that ourprocess has changed.
Alpha and Beta Risk (Type I and Type II Errors)
The hypothesis test gives a clear, unambiguous procedure formaking a decision based on the distributions and
assumptions outlined above. Unfortunately, there may be asubstantial probability of making the wrongdecision. In themaximum likelihood example of Fig. 3, for the singleobservation as drawn we accept the alternative hypothesis
H1. However, examining the distribution corresponding to
H0, we see that a nonzero probability exists of belonging
toH1.Two types of errors are of concern:
(27)
We note that the Type I error (or probability of afalse alarm) is based entirely on our decision rule and doesnot depend on the size of the shift we are seeking to detect.The Type II error (or probability of missing a real shift),on the other hand, depends strongly on the size of the shift.This Type II error can be evaluated for the distributions anddecision rules above; for the normal distribution of Fig. 3,we find
(28)where is the shift we wish to detect.
The power of a statistical test is defined as
(29)
that is, the power of the test is the probability of correctlyrejectingH0. Thus the power depends on the shift we wish
to detect as well as the level of false alarms ( or Type Ierror) we are willing to endure. Power curves are oftenplotted of versus the normalized shift to detect ,
for a given fixed (e.g., ), and as a function ofsampling size.
Hypothesis Testing and Sampling Plans. In the previousdiscussion, we described a hypothesis test for detection of ashift in mean of a normal distribution, based on a singleobservation. In realistic situations, we can often makemultiple observations and improve our ability to make acorrect decision. If we take nobservations, then the sample
Figure 3. Distributions of x under the null hypothesisH0,and under the hypothesis H1 that a positive shift has
occurred such that .
xi xi
0 1
f0 xi( )H0>
xi x* z z 2
xi
xi
Pr (Type I error)=
Pr reject H0 H0is true( ) f0 x( ) xdx*
= =
Pr (Type II error)=
Pr accept H0 H1is true( ) f1 x( ) xd
x*
= =
z 2
---
z 2
---
=
Power 1 Pr reject H0 H0is false( )=
d =
0.05=
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
10/22
10
mean is normally distributed, but with a reduced variance
as illustrated by our five-wafer average of oxidethickness. It now becomes possible to pick an riskassociated with a decision on the sampling distributionthatis acceptable, and then determine a sample size nin order toachieve the desired level of risk. In the first step, we stillselect based on the risk of false alarm , but now this
determines the actual unnormalized decision point using. We finally pick n(which determines
as just defined) based on the Type II error, which alsodepends on the size of the normalized shift to bedetected and the sample size n. Graphs and tables areindispensable in selecting the sample size for a given dand
; for example, Fig. 4 shows the Type II error associatedwith sampling from a unit normal distribution, for a fixedType I error of 0.05, and as a function of sampling size nandshift to be detected d.
Consider once more our wafer average oxide thickness
. Suppose that we want to detect a 1.58 (or 0.5 ) shift
from the 100 mean. We are willing to accept a false alarm5% of the time ( ) but can accept missing the shift20% of the time ( ). From Fig. 4, we need to take
. Our decision rule is to declare that a shift has
occurred if our 30 sample average or .If it is even more important to detect the 0.5 shift in mean(e.g., with probability of missing the shift), thenfrom Fig. 4 we now need samples, and our decision
rule is to declare a shift if the 40 sample average
or .
Control Chart Application. The concepts of hypothesistesting, together with issues of Type I and Type II error aswell as sample size determination, have one of their mostcommon applications in the design and use of control charts.
For example, the control chart can be used to detect whena significant change from normal operation occurs. Theassumption here is that when the underlying process isoperating under control, the fundamental process population
is distributed normally as . We periodically
draw a sample of nobservations and calculate the average ofthose observations ( ). In the control chart, we essentiallyperform a continuous hypothesis test, where we set the upperand lower control limits (UCLs and LCLs) such that fallsoutside these control charts with probability when theprocess is truly under control (e.g., we usually select togive a 0.27% chance of false alarm). We would then choosethe sample size nso as to have a particular power (that is, aprobability of actually detecting a shift of a given size) aspreviously described. The control limits (CLs) would then be
set at(30)
Note that care must be taken to check key assumptionsin setting sample sizes and control limits. In particular, oneshould verify that each of the nobservations to be utilized informing the sample average (or the sample standarddeviation in an s chart) are independent and identicallydistributed. If, for example, we aggregate or take the averageof n successive wafers, we must be sure that there are nosystematic within-lot sources of variation. If such additionalvariation sources do exist, then the estimate of variance
formed during process characterization (e.g., in our etchexample) will be inflated by accidental inclusion of thissystematic contribution, and any control limits one sets willbe wider than appropriate to detect real changes in theunderlying random distribution. If systematic variation doesindeed exist, then a new statistic should be formed thatblocks against that variation (e.g., by specifying whichwafers should be measured out of each lot), and the controlchart based on the distribution of that aggregate statistic.
EXPERIMENTAL DESIGN AND ANOVA
In this section, we consider application and further
development of the basic statistical methods alreadydiscussed to the problem of designing experiments toinvestigate particular effects, and to aid in the construction ofmodels for use in process understanding and optimization.First, we should recognize that the hypothesis testingmethods are precisely those needed to determine if a newtreatment induces a significant effect in comparison to aprocess with a known distribution. These are known as one-sample tests. In this section, we next consider two-sample
Figure 4. Probability of Type II error ( ) for a unitnormal process with Type I error fixed at , as afunction of sample size n and shift delta (equal to thenormalized shift ) to be detected.
2 n
z 2
x* z+ 2 n( )=
x*
d =
0
0. 1
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1
0 0.5 1 1.5 2
Delta (in units of standard deviation)
n = 2
n = 3
n = 5
n = 10
n = 20
n = 30
n = 40
0.05=
Ti
0.05=
0.20=
n 30=
T 98.9< T 101.1>
0.10=
n 40=
T 99.0
x
xi N 2,( )
x
x
3
CL z 2n
------- =
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
11/22
11
tests to compare two treatments in an effort to detect atreatment effect. We will then extend to the analysis ofexperiments in which many treatments are to be compared(k-sampletests), and we present the classic tool for studyingthe results --- ANOVA (6).
Comparison of Treatments: Two-Sample Tests
Consider an example where a new process B is to becompared against the process of record (POR), process A. Inthe simplest case, we have enough historical information onprocess A that we assume values for the yield mean and
standard deviation . If we gather a sample of 10 wafers forthe new process, we can perform a simple one-samplehypothesis test using the 10 wafer sampling
distribution and methods already discussed, assuming thatboth process A and B share the same variance.
Consider now the situation where we want to comparetwo new processes. We will fabricate 10 wafers with process
A and another 10 wafers with process B, and then we willmeasure the yield for each wafer after processing. In order toblock against possible time trends, we alternate betweenprocess A and process B on a wafer by wafer basis. We areseeking to test the hypothesis that process B is better thanprocess A, that is H1: , as opposed to the null
hypothesis H0: . Several approaches can be used
(5).
In the first approach, we assume that in each process Aand B we are random sampling from an underlying normaldistribution, with (1) an unknown mean for each process and
(2) a constant known value for the population standarddeviation of . Then , ,
, and . We can then
construct the sampling distribution for the difference inmeans as
(31)
(32)
Even if the original process is moderately non-normal, thedistribution of the difference in sample means will be
approximately normal by the central limit theorem, so wecan normalize as
(33)
allowing us to examine the probability of observing themean difference based on the unit normal
distribution, .
The disadvantage of the above method is that it dependson knowing the population standard deviation . If suchinformation is indeed available, using it will certainlyimprove the ability to detect a difference. In the secondapproach, we assume that our 10 wafer samples are again
drawn by random sampling on an underlying normalpopulation, but in this case we do not assume that we know aprioriwhat the population variance is. In this case, we mustalso build an internal estimate of the variance. First, we
estimate from the individual variances:
(34)
and similarly for . The pooled variance is then
(35)
Once we have an estimate for the population variance, wecan perform ourt test using this pooled estimate:
(36)
One must be careful in assuming that process A and B sharea common variance; in many cases this is not true and moresophisticated analysis is needed (5).
Comparing Several Treatment Means via ANOVA
In many cases, we are interested in comparing severaltreatments simultaneously. We can generalize the approach
discussed above to examine if the observed differences intreatment means are indeed significant, or could haveoccurred by chance (through random sampling of the sameunderlying population).
A picture helps explain what we are seeking toaccomplish. As shown in Fig. 5, the population distributionfor each treatment is shown; the mean can differ because thetreatment is shifted, while we assume that the populationvariance is fixed. The sampling distribution for eachtreatment, on the other hand, may also be different if adifferent number of samples is drawn from each treatment(an unbalanced experiment), or because the treatment is in
fact shifted. In most of the analyses that follow, we willassume balanced experiments; analysis of the unbalancedcase can also be performed (7). It remains important torecognize that particular sample values from the samplingdistributions are what we measure. In essence, we mustcompare the variance between two groups (a measure of thepotential shift between treatments) with the variancewithin each group (a measure of the sampling variance).Only if the shift is large compared to the sampling
A
B A>
B A>
B A=
nA 10= nB 10=
Var yA{ } 2
nA= Var yB{ } 2
nB=
Var yB yA{ } 2
nA------
2
nB------+ 2 1
nA------ 1
nB------+
= =
sB A 1
nA------ 1
nB------+=
z0
yB yA( ) B A( )
1nA------ 1
nB------+
-----------------------------------------------------=
yB yA
Pr z z0>( )
sA2 1
nA 1--------------- yAi yA( )
i 1=
nA
=
sB2
s2 nA 1( )sA
2nB 1( )sB
2+
nA nB 2+----------------------------------------------------------=
t0
yB yA( ) B A( )
s1
nA------ 1
nB------+
-----------------------------------------------------=
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
12/22
12
variance are we confident that a true effect is in place. Anappropriate sample size must therefore be chosen usingmethods previously described such that the experiment ispowerful enough to detect differences between thetreatments that are of engineering importance. In thefollowing, we discuss the basic methods used to analyze theresults of such an experiment (8).
First, we need an estimate of the within-group variation.Here we again assume that each group is normally
distributed and share a common variance . Then we canform the sum of squares deviations within the tth group
as
(37)
where is the number of observations or samples in group
t. The estimate of the sample variance for each group, alsoreferred to as the mean square, is thus
(38)
where is the degrees of freedom in treatment t. We can
generalize the pooling procedure used earlier to estimate acommon shared variance across ktreatments as
(39)
where is defined as the within-treatment or within-
group mean square and is the total number of
measurements, , or simply if
all treatments consist of the same number of samples, .
In the second step, we want an estimate of between-group variation. We will ultimately be testing the hypothesis
. The estimate of the between-group
variance is
(40)
where is defined as the between treatment mean
square, , and is the overall mean.
We are now in position to ask our key question: Are thetreatments different? If they are indeed different, then thebetween-group variance will be larger than the within-groupvariance. If the treatments are the same, then the between-group variance should be the same as the within-groupvariance. If in fact the treatments are different, we thus find
that
(41)
where is treatment ts effect. That is, is
inflated by some factor related to the difference betweentreatments. We can perform a formal statistical test fortreatment significance. Specifically, we should consider theevidence to be strong for the treatments being different if the
ratio is significantly larger than 1. Under our
assumptions, this should be evaluated using the F
distribution, since .
We can also express the total variation (total deviationsum of squares from the grand mean ) observed in the
data as
(42)
where is recognized as the variance in the data.
Analysis of variance results are usually expressed intabular form, such as shown in Table 3. In addition tocompactly summarizing the sum of squares due to variouscomponents and the degrees of freedom, the appropriate Fratio is shown. The last column of the table usually containsthe probability of observing the statedF ratio under the nullhypothesis. The alternative hypothesis --- that one or more
Figure 5. Pictorial representation of multiple treatmentexperimental analysis. The underlying population for eachtreatment may be different, while the variance for eachtreatment population is assumed to be constant. The darkcircles indicate the sample values drawn from thesepopulations (note that the number of samples may not bethe same in each treatment).
A B C
y
yA
yB
yC
Populationdistribution
2
SSt
SSt ytjyt( )
2
j 1=
nt
=
nt
st2 SSt
t-------
SSt
nt 1-------------= =
t
sR2 1s1
2 2s22 ksk
2+ + +
1 2 k+ + +----------------------------------------------------------
SSR
N k-------------
SSR
R---------= = =
SSR R
N
N n1 n2 nk+ + += N nk=
k n
1 2 k= = =
sT2
ntyt y( )2
t 1=
k
SST
T---------= =
SST T
T k 1= y
sT2 estimates 2
ntt2
k 1( )----------------
t 1=
k
+
t t = sT2
sT2
sR2
sT2
sR2
Fk 1 N k,
SSD
SSD yti y2( )
i 1=
nt
t 1=
k
=
sD2 SSD
D----------
SSD
N 1-------------= =
sD2
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
13/22
13
treatments have a mean different than the others --- can beaccepted with confidence.
As summarized in Table 3, analysis of variance can alsobe pictured as a decomposition of the variance observed inthe data. That is, we can express the total sum of squareddeviations from the grand mean as , or the
between-group sum of squares added to the within-groupsum of squares. We have presented the results here aroundthe grand mean. In some cases, an alternative presentation is
used where a total sum of squares
is used. In this case the total sum of squares also includes
variation due to the average, , or
. In Table 3, k is the
number of treatments and N is the total number of
observations. The Fratio value is , which is compared
against the F ratio for the degrees of freedom and .
The p value shown in such a table is
Pr( ).
While often not explicitly stated, the above ANOVAassumes a mathematical model:
(43)
where are the treatment means, and are the residuals:
(44)
where is the estimated treatment mean. It is
critical that one check the resulting ANOVA model. First, theresiduals should be plotted against the time order in
which the experiments were performed in an attempt to
distinguish any time trends. While it is possible to randomizeagainst such trends, we lose resolving power if the trend islarge. Second, one should examine the distribution of theresiduals. This is to check the assumption that the residualsare random [that is, independent and identically distributed(IID) and normally distributed with zero mean] and look forgross non-normality. This check should also include anexamination of the residuals for each treatment group. Third,one should plot the residuals versus the estimates and be
especially alert to dependencies on the size of the estimate(e.g., proportional versus absolute errors). Finally, oneshould also plot the residuals against any other variables ofinterest, such as environmental factors that may have beenrecorded. If unusual behavior is noted in any of these steps,additional measures should be taken to stabilize the variance(e.g., by considering transformations of the variables or byreexamining the experiment for other factors that may needto be either blocked against or otherwise included in theexperiment).
Two-Way Analysis of Variance
Suppose we are seeking to determine if various treatmentsare important in determining an output effect, but we mustconduct our experiment in such a way that another variable(which may also impact the output) must also vary. Forexample, suppose we want to study two treatments A and Bbut must conduct the experiments on five different processtools (tools 1-5). In this case, we must carefully design theexperiment to blockagainst the influence of the process toolfactor. We now have an assumed model:
(45)
where are the block effects. The total sum of squares
can now be decomposed as(46)
with degrees of freedom(47)
where b is the number of blocking groups, k is the
number of treatment groups, and . As
before, if the blocks or treatments do in fact include anymean shifts, then the corresponding mean sum of squares(estimates of the corresponding variances) will again beinflated beyond the population variance (assuming thenumber of samples at each treatment is equal):
(48)
Table 3: Structure of the Analysis of Variance Table, for Single Factor (Treatment) Case
Source of Variation Sum of Squares Degrees of Freedom MeanSquare
Fratio Pr(F)
Between treatments
Within treatments
Total about thegrand mean
SST T k 1= sT2
sT2
sR2 p
SSR R N k= sR2
SSD SST SSR+= D T R+ N 1= = sD2
100 1 p( )
SSD SST SSR+=
SS y12
y22 yN
2+ + +=
SSA Ny2=
SS SS A SSD+ SSA SST SSR++= =
sT2
sR2
T R
FT R,sT
2sR
2> 1 k= =
yti t ti+ t ti+ += =
t ti
ti yti yti N 0 2,( )=
yti yt
t+= =
ti
yti t + + i ti+=
i SS
SS SS A SSB+ SST SSR+ +=
bk 1 b 1( ) k 1( ) b 1( ) k 1( )+ + +=
SSB k yi y( )2
i 1=
b
=
sB2 estimates 2 k
i2
b 1( )----------------
i 1=
b
+
sT2 estimates 2
ntt2
k 1( )----------------
t 1=
k
+
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
14/22
14
where we assume that ksamples are observed for each of thebblocks. So again, we can now test the significance of thesepotentially inflated variances against the pooled estimate
of the variance with the appropriate Ftest as summarizedin Table 4. Note that in Table 4 we have shown the ANOVAtable in the non-mean centered or uncorrected format, sothat due to the average is shown. The table can be
converted to the mean corrected form as in Table 3 byremoving the row, and replacing the total with the
mean corrected total with degrees of
freedom.
Two-Way Factorial Designs
While the above is expressed with the terminology of thesecond factor being considered a blocking factor, preciselythe same analysis pertains if two factors are simultaneouslyconsidered in the experiment. In this case, the blockinggroups are the different levels of one factor, and thetreatment groups are the levels of the other factor. Theassumed analysis of variance above is with the simpleadditive model (that is, assuming that there are nointeractions between the blocks and treatments, or betweenthe two factors). In the blocked experiment, the intent of theblocking factor was to isolate a known (or suspected) sourceof contamination in the data, so that the precision of theexperiment can be improved.
We can remove two of these assumptions in ourexperiment if we so desire. First, we can treat both variablesas equally legitimate factors whose effects we wish toidentify or explore. Second, we can explicitly design theexperiment and perform the analysis to investigateinteractionsbetween the two factors. In this case, the modelbecomes
(49)
where is the effect that depends on both factors
simultaneously. The output can also be expressed as
(50)
where is the overall grand mean, and are the main
effects, and are the interaction effects. In this case, the
subscripts are
The resulting ANOVA table will be familiar; the one keyaddition is explicit consideration of the interaction sum of
squares and mean square. The variance captured in thiscomponent can be compared to the within-group variance asbefore, and be used as a measure of significance for theinteractions, as shown in Table 5.
Another metric often used to assess goodness of fit of
a model is theR2. The fundamental question answered byR2
is how much better does the model do than simply using thegrand average.
(51)
where is the variance or sum of square deviations
around the grand average, and is the variance capturedby the model consisting of experimental factors ( and
) and interactions ( ). AnR2value near zero indicates
that most of the variance is explained by residuals( ) rather than by the model terms ( ),
while an R2 value near 1 indicates that the model sum ofsquare terms capture nearly all of the observed variation in
Table 4: Structure of the Analysis of Variance Table,
for the Case of a Treatment with a Blocking Factor
Source of Variation Sum of Squares Degrees of Freedom MeanSquare
Fratio Pr(F)
Average (correction factor) 1
Between blocks
Between treatments
Residuals
Total
SSA nk y2=
SSB k yi y( )
2
i 1=
b
= B b 1= s
B
2sB
2sR
2 pB
SST b yt y( )2
t 1=
k
= T k 1= sT
2sT
2sR
2 pT
SSR R b 1( ) k 1( )= sR2
SS N bk= = sD2
sR2
SSA
SSA SS
SSD N 1=
yti j ti ti j+=
ti
ti
t
i
ti
+ + +=
y yt y( ) yi y( ) yti yt yi y+( )+ + +=
t iti
t 1 2 k, , ,= where k is the # levels of first factori 1 2 b, , ,= where b is the # levels of second factorj 1 2 m, , ,= where m is the # replicates at t, i factor
R2 SSM
SSD----------
SST SSB SSI+ +( )SSD
---------------------------------------------= =
SSD
SSM
SST
SSB SSI
SSR SSD SSM= SSM
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
15/22
15
the data. It is clear that a more sophisticated model withadditional model terms will increase , and thus an
apparent improvement in explanatory power may resultfrom adding model terms. An alternative metric is theadjustedR2, where a penalty is added for the use of degreesof freedom in the model:
(52)
which is more easily interpreted as the fraction of the
variance that is not explained by the residuals ( ). In
important issue, however, is that variance may appear to be
explained when the model in fact does not fit thepopulation. One should formally test for lack of fit (asdescribed in the regression modeling section to follow)
before reporting R2, since the R2 is only a meaningfulmeasure if there is no lack of fit.
Several mnemonics within the factorial design ofexperiments methodology facilitate the rapid or manualestimation of main effects, as well as interaction effects (8).Elements of the methodology include (a) assignment ofhigh (+) and low (-) values for the variables, (b) codingof the experimental combinations in terms of these high and
low levels, (c) randomization of the experimental runs (asalways!), and (d) estimation of effects by attention to theexperimental design table combinations.
For example, one might perform a 23 full factorialdesign [where the superscript indicates the number of factors(three in this case), and the base indicates the number oflevels for each factor (two in this case)], where the factorsare labeled A, B, and C. The unique eight combinations of
these factors can be summarized as in Table 6, with theresulting measured results added during the course of theexperiment.The main effect of a factor can be estimated by
taking the difference between the average of the + level forthat factor and the average of the - levels for that factor ---for example, EffectA= and similarly for the other
main effects. Two level interactions can be found in an
analogous fashion; where
one takes the difference between one factor averages at thehigh and low values of the second factor. Simple methods arealso available in the full factorial case for estimation offactor effect sampling variances, when replicate runs havebeen performed. In the simple case above where only asingle run is performed at each experimental replicate, there
are no simple estimates of the underlying process ormeasurement variance, and so assessment of significance isnot possible. If, however, one performs replicates at the
ith experimental condition, one can pool the individual
estimates of variance at each of the experimental
conditions to gain an overall variance estimate (8):
(53)
where are the degrees of freedom at condition i,
and g is the total number of experimental conditionsexamined.
The sampling variance for an effect estimate can then becalculated; in our previous example we might perform tworuns at each of the eight experimental points, so that
and
(54)
Table 5: Structure of the Analysis of Variance Table,
for Two-factor Case with Interaction Between Factors
Source of Variation Sum of Squares Degrees of Freedom MeanSquare
Fratio Pr(F)
Between levels offactor 1
Between levels offactor 2
Interaction
Within groups (error)
Total (mean corrected)
SST b yt y( )2
t 1=
k
= T k 1= sT
2sT
2sE
2 pT
SSB k yi y( )2
i 1=
b
= B b 1= sB
2sB
2sE
2 pB
SSI I k 1( ) b 1( )= sI2
sI2
sE2 pI
SSE E bk m 1( )= sE2
SSD bkm 1= sD2
SSM
Adj. R2 1
SSR RSSD D---------------------=
1sR
2
sD2
------ 1 Mean Sq. of ResidualMean Sq. of Total
----------------------------------------------------= =
sR2
sD2
yA+ yA-
InteractionAB12--- yA B+( ) yA B-( )( )=
mi
si2
s2 1s1
2 2s22 gsg
2+ + +
1 2 g+ + +----------------------------------------------------------=
i mi 1=
i 1=
Var{EffectA } Var yA+{ } Var yA-{ }+s
2
8---- s
2
8----+ s
2
4----= = =
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
16/22
16
These methods can be helpful for rapid estimation ofexperimental results and for building intuition aboutcontrasts in experimental designs; however, statisticalpackages provide the added benefit of assisting not only inquantifying factor effects and interactions, but also inexamination of the significance of these effects and creationof confidence intervals on estimation of factor effects andinteractions. Qualitative tools for examining effects includepareto charts to compare and identify the most importanteffects, and normal or half-normal probability plots todetermine which effects are significant.
In the discussion so far, the experimental factor levels aretreated as either nominal or continuous parameters. Animportant issue is the estimation of the effectof a particularfactor, and determination of the significanceof any observed
effect. Such results are often pictured graphically in asuccinct fashion, as illustrated in Fig. 6 for a two-factorexperiment, where two levels for each of factor A and factorB are examined.
In the full factorial case, interactions can also beexplored, and the effects plots modified to show the effect of
each factor on the output parameter of concern (yield in thiscase), but at different levels of the other factor. Various casesmay result; as shown in Fig. 7 no interaction may beobserved, or a synergistic (or anti-synergistic) interactionmay be present. These analyses are applicable, for example,when the factor levels are discrete or nominal decisions to bemade; perhaps level (+) for factor A is to perform a cleanstep while (-) is to omit the step, and level (+) for factor B is
to use one chemical in the step while level (-) is to use adifferent chemical.
Nested Variance Structures
While blocking factors may seem somewhat esoteric orindicative of an imprecise experiment, it is important torealize that blocking factors do in fact arise extremely
frequently in semiconductor manufacturing. Indeed, mostexperiments or sets of data will actually be taken undersituations where great care must be taken in the analysis ofvariance. In effect, such blocking factors arise due to nestedvariance structuresin typical semiconductor manufacturingwhich restrict the full randomization of an experimentaldesign (9). For example, if one samples die from multiplewafers, it must be recognized that those die reside withindifferent wafers; thus the wafer is itself a blocking factor andmust be accounted for.
For example, consider multiple measurements of oxidefilm thickness across a wafer following oxide deposition.
One might expect that measurements from the same waferwould be more similar to each other than those acrossmultiple wafers; this would correspond to the case where thewithin-wafer uniformity is better than the wafer-to-waferuniformity. On the other hand, one might also find that themeasurements from the corresponding sites on each wafer(e.g., near the lower left edge of the wafer) are more similarto each other than are the different sites across the samewafer; this would correspond to the case where wafer-to-
Table 6: Full Factorial 23Experimental Design (with
Coded Factor Levels)
ExperimentConditionNumber
Factor A Factor B Factor CMeasured
Result
1 - - -
2 - - +3 - + -
4 - + +
5 + - -
6 + - +
7 + + -
8 + + +
Figure 6. Main effects plot for two factor, two levelexperiment. The influence of Factor A on yield is largerthan that of Factor B. Analysis of variance is required inorder to determine if the observed results are significant(and not the result of chance variation).
Factor A
(+)(-)
Yield
Factor B
(+)(-)
Yield
Figure 7. Interaction plot. In case 1, no clear interaction isobserved: Factor B does not appear to change the effect offactor A on the output. Rather, the effects from factor Aand factor B appear to be additive. In case 2, the level offactor B does influence strongly the response of yield tothe high level of factor A.
Factor A
(+)(-)
Yield
Factor A
(+)(-)
Yield B+
B-
B+
B-
ase ase
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
17/22
17
wafer repeatability is very good, but within-wafer uniformitymay be poor. In order to model the important aspects of theprocess and take the correct improvement actions, it will beimportant to be able to distinguish between such cases andclearly identify where the components of variation arecoming from.
In this section, we consider the situation where webelieve that multiple site measurements within the wafercan be treated as independent and identical samples. This isalmost never the case in reality, and the values of withinwafer variance that result are not true measures of wafervariance, but rather only of the variation across those(typically fixed or preprogrammed) sites measured. In ouranalysis, we are most concerned that the wafer is acting as a
blocking factor, as shown in Fig. 8. That is, we first considerthe case where we find that the five measurements we takeon the wafer are relatively similar, but the wafer-to-waferaverage of these values varies dramatically.This two-levelvariance structure can be described as (9):
(55)
where is the number of measurements taken on each of
wafers, and and are independent normal
random variables drawn from the distributions of wafer-to-wafer variations and measurements taken within the ith
wafer, respectively. In this case, the total variation in oxide
thickness is composed of both within-wafer variance and
wafer-to-wafer variance :
(56)
Note, however, that in many cases (e.g., for control charting),
one is not interested in the total range of all individualmeasurements, but rather, one may desire to understand howthe set of wafer meansitself varies. That is, one is seeking toestimate
(57)
where indicates averages over the measurements within
any one wafer.
Substantial care must be taken in estimating thesevariances; in particular, one can directly estimate the
measurement variance and the wafer average variance
, but must infer the wafer-level variance using
(58)
The within-wafer variance is most clearly understood as the
average over the available wafers of the variance
within each of those iwafers:
(59)
where indicates an average over thejindex (that is, a
within-wafer average):
(60)
The overall variance in wafer averages can be estimated
simply as
(61)
where is the grand mean over all measurements:
(62)
Thus, the wafer-level variance can finally be estimated as
(63)
The same approach can be used for more deeply nestedvariance structures (9,10). For example, a common structureoccurring in semiconductor manufacturing is measurementswithin wafers, and wafers within lots. Confidence limits canalso be established for these estimates of variance (11). Thecomputation of such estimates becomes substantiallycomplicated, however (especially if the data are unbalancedand have different numbers of measurements per samples ateach nested level), and statistical software packages are thebest option.
Figure 8. Nested variance structure. Oxide thicknessvariation consists of both within-wafer and wafer-to-wafercomponents.
Wafer number
OxideThickness(A)
2 31 4 5 6
1010
1020
1030
1040
1050
1060
yij Wi Mj i( )+ +=
Wi N 0 W2,( ) for i 1nW=
Mj i( ) N 0 M2,( ) for j 1nM=
nM
nW Wi Mj i( )
M2
W2
T2 W
2 M2+=
W
2 W2 M
2
nM-------+=
W
M2
W
2 W2
W2
W
2 M2
nM-------=
nW si2
sM2 1
nW------- si
2
i 1=
nW
1nW-------
Yij Yi( )2
nM 1--------------------------------
j 1=
nM
i 1=
nW
= =
Yi
Yi1
nM------- Yij
j 1=
nM
=
sW
2 1nW 1---------------- Yi Y ( )
2
i 1=
nW
=
Y
Y 1
nWnM-------------- Yij
j 1=
nM
i 1=
nW
=
sW2
sW
2 sM2
nM-------=
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
18/22
18
Several assumptions are made in the analysis ofvariance components for the nested structures above.Perhaps the most important is an assumption of randomsampling within each level of nesting. For example, weassume that each measurement (within each wafer) is IIDand a random sample from within the wafer. If the samemeasurement points are taken on each wafer, however, one isnot in fact truly estimating the within-wafer variation, but
rather the fixed-effect variance between these measurementpoints. Systematic spatial patterns, e.g., bowl, donut, orslanted plane patterns across the wafer often occur due to gasflow, pressure, or other equipment nonuniformities. For thisreason, it is common practice to use a spatially consistentfive-point (or 21-point or 49-point) sampling scheme whenmaking measurements within a wafer. An option which addscomplexity but also adds precision is to model each of thesesites separately (e.g., maintain left, right, top, bottom, andcenter points) and consider how these compare with otherpoints within the wafer, as well as from wafer-to-wafer.Great care is required in such site modeling approaches,however, because one must account for the respectivevariances at multiple levels appropriately in order to avoidbiased estimates (12-14).
Experimental designs that include nested variancesampling plans are also sometimes referred to as split-plotdesigns, in which a factorial design in fact has restrictions onrandomization (7,15). Among the most common restrictionsare those due to spatial factors, and spatial modeling likewiserequires great care (16). Other constraints of the realworld, such as hardware factors, may make completerandomization infeasible due to the time and cost ofinstalling/removing hardware (e.g., in studying alternative
gas distribution plates in a plasma reactor). Methods exist forhandling such constraints (split-plot analysis), but theanalysis cannot be done if the experimental sampling plandoes not follow an appropriate split-plot design. Using split-plot analyses, however, we can resolve components ofvariation due (in the case of sites within wafers) intoresidual, site, wafer, and wafer-site interactions, as well asthe effects of the treatment under consideration. For thesereasons, it can be expected that nested variance structures orsplit-plot designs will receive even greater future attentionand application in semiconductor manufacturing.
Progression of Experimental Designs
It is worth considering when and how various experimentaldesign approaches might best be used. When confrontedwith a new problem which lacks thorough preexistingknowledge, the first step should be screening experimentswhich seek to identify what the important variables are. Atthis stage, only crude predictions of experimental effects asdiscussed above are needed, but often a large number ofcandidate factors (often six or more) are of potential interest.By sacrificing accuracy and certainty in interpretation of the
results (primarily by allowing interactions to confound withother interactions or even with main effects), one can oftengain a great deal of initial knowledge with reasonable cost.In these cases, fractional factorial, Plackett-Burmann, andother designs may be used.
Once a smaller number of effects have been identified,full factorial or fractional factorial designs are often utilized,
together with simplified linear model construction andanalysis of variance. In such cases, the sampling plan mustagain be carefully considered in order to ensure thatsufficient data is taken to draw valid conclusions. It is oftenpossible to test for model lack of fit, which may indicate thatmore thorough experiments are needed or that additionalexperimental design points should be added to the existingexperimental data (e.g., to complete half fractions). The thirdphase is then undertaken, which involves experimentaldesign with small numbers of factors (e.g., two to six) tosupport linear effects, interactions, and second-order(quadratic) model terms. These regression models will beconsidered in the next section.
A variety of sophisticated experimental design methodsare available and applicable to particular problems. Inaddition to factorial and optimal design methods (8,10),robust design approaches (as popularized by Taguchi) arehelpful, particularly when the goal is to aid in theoptimization of the process (17,18).
RESPONSE SURFACE METHODS
In the previous section, we considered the analysis ofvariance, first in the case of single treatments, and then in thecase when blocking factors must also be considered. These
were generalized to consideration of two factor experiments,where the interaction between these factors can also beconsidered. If the factors can take on continuous values, theabove analysis is still applicable. However, in these cases, itis often more convenient and useful to consider or seek tomodel (within the range of factor levels considered) an entireresponse surfacefor the parameter of interest. Specifically,we wish to move from our factorial experiments with anassumed model of the form
(64)
where we can only predict results at discrete prescribed i, jlevels of factors A and B, toward a new model of the process
of the form
(65)
where each is a particular factor of interest.
In this section, we briefly summarize the methods forestimation of the factor response coefficients , as well as
yij Ai Bj ij+ + +=
y 1x1( ) 2x
2( ) + + +=
xj( )
xminj( )
xmaxj( ),[ ]
xj( )
j
-
8/10/2019 ISE235 08 Supp 6 SPC AppSemi Statistical Methods for Semiconductor Manufcturing Encyclopedia EE 1999
19/22
19
for analysis of the significance of such effects based onexperimental design data. We begin with a simple one-parameter model, and we build complexity and capabilityfrom there.
Single Variable Least-squares Regression
The standard approach used here is least-squares regressionto estimate the coefficients in regression models. In thesimple one-parameter case considered here, our actualresponse is modeled as
(66)
where indicates the ithmeasurement, taken at a value
for the explanatory variablex. The estimate for the output is
thus simply where we expect some residual
error . least-squares regression finds the best fit of our
model to the data, where best is that bwhich minimizesthe sum of squared errors between the prediction andobserved ndata values:
(67)
It can easily be shown that for linear problems, a directsolution for bwhich gives is possible and will occur
when the vector of residuals is normal to the vector of
values:
(68)
An important issue is the estimate of the experimentalerror. If we assume that the model structure is adequate, we
can form an estimate of simply as
(69)One may also be interested in the precision of the
estimate b; that is, the variance in b:
(70)
assuming that the residuals are independent and identicallynormally distributed with mean zero. More commonly, one
refers to the standard error , and writes. In a similar fashion, a confidence interval for our
estimate of can be defined by noting that the standardizedvalue for bshould be t-distributed:
(71)
where is the true value for , so that
(72)
Regression results should also be framed in an analysisof variance framework. In the simple one factor case, asimple ANOVA table might be as shown in Table 7. In thiscase, is the sum of squared values of the estimates, and
is an estimate of the variance explained by the model,
where our model is purely linear (no intercept term) as givenin Eq. (66). In order to test significance, we must compare
the ratio of this value to the residual variance using the
appropriateF test. In the case of a single variable, we note
that theF test degenerates into thet test: , and a t-
test can be used to evaluate the significance of the modelcoefficient.
In the analysis above, we have assumed that the valuesfor have been selected at random and are thus unlikely to
be replicated. In many cases, it may be possible to repeat theexperiment at particular values, and doing so gives us theopportunity to decompose the residual error into twocontributions. The residual sum of squares can be
broken into a component due to lack of fit and a
component due to pure