Article
-
Upload
hutomorezky -
Category
Documents
-
view
221 -
download
0
Transcript of Article
LC•GC Europe 18(4) 215–218 (2005)2
PRACTICAL DATA HANDLING
Robust StatisticsThe box plot is based on robust statistics.Robust statistics are called thus becausethey are more resistant (robust) to thepresence of outliers than the classicalstatistics based on the normal distribution.Consider, for instance, the determinationof the mean, one of the simplest statisticaloperations possible and suppose thefollowing series of results was obtained: 2, 3, 2, 4, 3, 4, 3. The mean of these datais 3. It describes one of the aspects of thisdata set, the so called central tendency, inan adequate way. Now suppose an outlieris present, perhaps because an error wasmade or because one of the objectsmeasured belongs to another populationthan the other six. One of the 4s isreplaced by a 11: 2, 3, 2, 4, 3, 11, 3. Themean is now 4, which is no longer arepresentative measure for the centraltendency of the data set since five of theseven values are smaller than 4 and onlyone is larger: the mean is not robusttowards outlying observations. It can beshown that this is also true for the usualmeasure of dispersion or spread within thedata, the standard deviation.
The MedianRobust statistics replaces the mean by othermeasures of central tendency and the mostcommon of those is the median. When thenumber of data is odd, the median is themiddle observation in a ranked series of
data. When it is even, it is the mean of thetwo middle observations. In the firstexample of the preceding section theranked data are: 2, 2, 3, 3, 3, 4, 4. Thereare seven data and, therefore, the fourth inthe series is the median: it is equal to 3.The second example yields the rankedseries 2, 2, 3, 3, 3, 4, 11. The median is still3 and is, therefore, not affected by theoutlier: it is robust towards the outlier. Evenif the outlier were 100, the median wouldstill be 3.
Interquartile RangeThe interquartile range or IQR is a robustway of describing the dispersion of thedata. It is the range within which themiddle 50% of the ranked data are found.This is also the range between what iscalled the lower quartile value and theupper quartile value. Let us consider asomewhat larger data set to show how theIQR is determined (See Table 1).
First the median is obtained. Since thenumber of data is even, the median is equalto the mean of the two middle values, 13.1and 13.3, (i.e., the median is 13.2). Thedata is then split into an upper and a lowerhalf. When the number of data is odd,
which is not the situation in the example,the median value is included in both halves.The median of the first half is the lowerquartile value, (i.e., the value below whichthe 25% lowest values are found). Here it isthe sixth value, (i.e., 12.0). Likewise theupper quartile value is the value abovewhich the 25% highest values are found. Itis equal to 14.5 and the IQR is thedifference between the two quartile values,(i.e., IQR = 14.5 – 12.0 = 2.5).
The Box and The WhiskersThe median and the IQR are used toconstruct the box. It has a height equal tothe IQR and is drawn so that it starts at thelower quartile value and stops at the upperquartile value. A horizontal bar is drawn atthe height of the median. In our example(Figure 1(a)) the box encompasses thevalues from 12.0 to 14.5, and thehorizontal bar for the median is drawn at13.2. The whiskers indicate the range ofthe data and they are usually representedas vertical lines ending in a small horizontalline. In our situation, they are drawn at11.0 and 16.0.
A provision is also made for therepresentation of extreme values. An upper
Visual Presentation of Data by Means of Box PlotsD.L. Massart,a J. Smeyers-Verbeke,a X. Caprona and Karin Schlesierb
aVrije Universiteit Brussel, BelgiumbBundesinstitut für Risikobewertung (Federal Institute for Risk Assessment), Berlin, Germany.
Data analysis should always start by (literally) looking at the data.An efficient way to do this is to use box and whisker plots, which,for short, are called box plots. All figures in this column are boxplots and Figures 2 to 4 are box plots for real data sets. In thiscolumn we will explain how to construct them and how they canhelp you to learn more about your data.
11.0 11.2 11.5 11.6 11.9 12.0 12.2 12.8 12.9 12.9 13.1
13.3 13.4 13.8 13.9 14.2 14.5 14.5 14.6 15.3 15.5 16.0
Table 1: Data set 1.
3www.lcgceurope.com
Practical Data Handling
extreme value limit is computed as theupper quartile range � 1.5 � IQR and thelower extreme value limit as the lowerquartile range � 1.5 � IQR. These limitsare therefore 12.0 � 1.5 � 2.5 � 8.25 and14.5 � 1.5 � 2.5 � 18.25. None of thedata exceeds these values and, therefore,the box plot is left unchanged.
When the highest value 16.0 is replacedby 19.0, which is quite high compared withthe rest of the series (See Table 2), thehighest value now exceeds 18.25 and is,therefore, considered extreme. The boxplot emphasizes the presence of theextreme value by giving it a specificsymbol; in Figure 1(b), a � sign. The boxdoes not change except that the highestvalue within the extreme value limit is now15.5, instead of 16.0 and the upperwhisker is now drawn at this value.
Using Box PlotsBox plots can be used in many ways and,to illustrate this, we will make use of datafrom a European Union funded project,called Wine D(ata)B(ase). The aim of theproject is to be able to authenticate theorigin of wines from certain countriesoutside the European Union. For five suchcountries each year 100 samples are takenaccording to a sampling plan thatguarantees representativity as well aspossible. Some of these samples arecommercial wines, some are winesobtained from grapes collected by officialoenological institutes at exactly knownplaces and vinified according to knownprocedures (authentic wines). Initially closeto 100 variables were measured for each
sample, such as classical oenologicalparameters, macroelements, isotopic ratios,inorganic and organic minor and tracecompounds such as trace elements,biogenic amines, alcohols and esters. Thisyields very large databases and advanceddata analysis is needed. Many of the
techniques used are, of course,multivariate, such as principal componentanalysis (PCA),1 but, in an initial phase, it isalso necessary to analyse each variableindividually, (i.e., in a univariate way). Inwhat follows we will give examples of boxplots obtained for these data. Since theresults have not been published yet, wewill not identify the countries nor thevariables.
Extreme Values (Outliers)The most evident application is to identifysamples with extreme characteristics.Figure 2 gives examples of box plotsobtained for four countries and threevariables. Figure 2(a) tells us that thisvariable is very well behaved: no singlesample is considered extreme. In Figure2(b) there are a few extreme values.Country B has a low value for sample 093and one low and one high value forcountry D (samples 455 and 465respectively). In Figure 2(c) we see a verydifferent picture. There are now extremevalues for all countries, a rather large one
(sample 112) for country C and clusters ofextreme samples for countries B and D.
A question might be how conservativethe box plot is: does it tend to find many oronly few extreme values? For a normaldistribution, the IQR is equal to about 5standard deviations, which means that
about 2% of all data coming from such adistribution would be considered extreme.This is somewhat more than for the usualoutlier test where the recommended levelwould be 1%. Depending on the software,some box plots are more conservative inthe sense that in addition to the extremelevel as computed above, they use asecond extreme level, more distant fromthe box than the first one. This is thesituation in Figure 3, where a distinction ismade between close extreme values andfar away extremes ( the * symbol).
Classical outlier tests would not beeffective for the data set described here.
The most evident application is to identify samples withextreme characteristics.
2019181716151413121110
a
Valu
es
b
Figure 1: Box plots for 22 data points(see text). (a) Without and (b) withone extreme value.
1.0
�2.0A
Valu
es
(a) (b) (c)
�1.5
�1.0
�0.5
0
0.5
B C D
�2.0
A
�093�455
�047
�096
�112
�116
�453�455
�490500��484484�
�100�098096��099
097�
�465
�2.5
�1.5
�1.0
�0.5
0
B C D
1.0
A
0.5
1.5
2.0
2.5
B C D
Figure 2: Box plots for wine samples collected in four countries (A–D) for three variables (a–c).
11.0 11.2 11.5 11.6 11.9 12.0 12.2 12.8 12.9 12.9 13.1
13.3 13.4 13.8 13.9 14.2 14.5 14.5 14.6 15.3 15.5 19.0
Table 2: Data set 2.
LC•GC Europe 18(4) 215–218 (2005)4
Practical Data Handling
This should not be understood as astatement that outlier tests are never useful.For instance, when a laboratory repeatedlymeasures the same sample and one of theresults appears to be far away from therest, an outlier test can be useful to make adecision. The outlier test which is mostaccepted in the analytical community isGrubb’s test. It is the test recommended byISO to investigate replicate determinations.There is a single Grubb’s test, used todecide on a statistical basis if one singleresult is an outlier and there is a doubleGrubb’s test, which is used when there aretwo suspect values. While this test wouldprobably perform correctly for the data ofFigures 2(a) and (b), it is, however, clearthat it cannot be used for the variable ofFigure 2(c). For many sets of data, such asthe one here, the visual analysis by boxplots is to be preferred.
Finding extreme values, and sometimesmany extreme values, is the rule ratherthan the exception in data like this.Therefore, samples are not eliminatedbecause one or two variables showoutlying values, except when it is foundthat this is because of technical reasons,for instance malfunction of an instrument.A sample with uncharacteristic values maytruly represent part of the population ofsamples and should not be deleted. This isalso the reason why we prefer in thiscontext the term “extreme value” to“outlier”. Such samples should be markedas “suspect” or “extreme,” but still beprocessed.
Skewed DistributionsContrary to the implicit belief of manyanalytical chemists, data are often notnormally distributed. Data with low valuesmay for instance be more frequent thandata with a high value, leading to skeweddistributions with a long tail towards thehigh end of the distribution. Suchdistributions are often found for data setsconsisting of environmental, food andother natural samples and can easily bedetected with a box plot. Because there aremore data with low values, the median isshifted towards the low end of the datarange and is found towards the bottom ofthe box. Usually there are also severalsamples with extreme high results. Theymight (wrongly) be considered to beoutliers and deleted from the data set. Ifthey are kept in, they have too large aninfluence in the data analysis. Moreover,many statistical tests, such as the t-test,assume normal distribution of the data.When a skewed distribution is encountered
it is, therefore, always better to trytransforming it into a more symmetric andpreferably a normal distribution.
An example of a skewed distribution isshown in Figure 3(a). The box plotsconcern the same trace compound in twosets of about 50 wine samples. In Figure3(b) the same data as in Figure 3(a) areshown, but instead of the raw data, theirlogarithms are plotted. This is called a logtransform. There are now no longerextreme values and the median has movedtowards the centre of the box. This isindicative of what is called a log normaldistribution. Statistical tests like theKolmogorov–Smirnov test can be appliedto confirm that the data are indeed lognormally distributed. For the wine data setit was shown that nearly all, close to 100,variables were log normally distributed andthe log values were systematicallysubstituted for the original data in thesubsequent data analysis.
Comparing Series of ResultsComparing two or more series of results isone of the most often performed dataanalysis tasks. Classical statistical methodsare the t-test for comparing means and theF-test for comparing variances of two seriesof data, and analysis of variance (ANOVA)for more than two series. These methodsare vulnerable to the presence of outliersand are based on assumptions such asnormal distributions and (depending on thetest) equal variance.
The juxtaposition of box plots is anexcellent way to investigate if there aredifferences between the data sets and canbe applied without any statisticalassumptions. In our wine example, theanalyst would like to know whether certainvariables allow a distinction to be madebetween wines of different countries. InFigure 4 the discrimination potential of
three variables is investigated. It isimmediately clear that variable (a) isuseless: the box plots show that the fourcountries yield similar results. Variable (b)shows some potential since it discriminatesmost of the B samples from those fromcountry D; also C is to some extentseparated from both B and D. Alone, thisvariable is not able to perform a completediscrimination between any pair ofcountries, but together with othervariables, (i.e., using a multivariatemethod), it might still prove useful.Variable (c) achieves excellent separation ofcountry D from the others.
ConclusionAlways look at your data!
AcknowledgementThis publication contains results from theEU project “Establishing of a wine databank for analytical parameters for winesfrom third countries (wine DB)”, a shared-cost RTD project funded under the FifthFramework Programme of the EuropeanCommunity, within the Competitive andSustainable Growth Programme (G6RD-CT-2001-00646). The authors are solelyresponsible for the content of thepublication. It does not represent theopinion of the Community nor is theCommunity responsible for any use thatmight be made of the data herein.
References1. D.L. Massart and Y. Vander Heyden, LC•GC
Eur., 17(11) 586–591, (2004).
Column editor, Desire Luc Massart is apart-time professor at the Vrije UniversiteitBrussel, Belgium. He performs research onchemometrics in process analysis and itsuse in the detection of counterfeitingproducts or illegal manufacturing
30
�10
Valu
es
(a)
0
10
20
●100115●100136●100114
●100130●100129
●100432
●100445
✳100434✳100447
✳100412
✳100452✳100426●100406
2.0
�1.5
(b)
�1.0
�0.5
0.5
0.0
1.0
1.5
Figure 3: Box plots of two series of wine samples. (a) Original data, (b) after logtransformation.
5www.lcgceurope.com
Practical Data Handling
processes. Johanna (An) Smeyers-Verbeke is aprofessor at the Vrije Universiteit Brusseland is head of the department of analyticalchemistry. Xavier Capron is a PhD student at thesame department. Karin Schlesier, PhD in natural sciences,employed at the Federal Institute for RiskAssessment, Berlin, Germany, is responsiblefor the coordination of the multinationalWINE DB project.
5.2
5.1
5.0
4.9
4.8
4.7
4.6
4.5
4.4
A
Valu
es
(a)�363
�361
�190
B C D
1.8
1.6
1.4
1.21.0
0.8
0.60.40.2
0.0
A
(b)
�001
�050
�063�116
�469�478
�095�096
�481
B C D96
A
98
100
102
104
106
108
(c)
�304
B C D
Figure 4: Box plots comparing the results obtained for wine samples from four countries (A–D) for three variables (a–c).