SUGI 24: Two Macros to Compute the Median for a List of ... · PDF fileTwo Macros to Compute...

4
Two Macros to Compute the Median for A List of Variables Fan Xu, SpectRx Inc., Norcross, GA Abstract SAS® does not have a built-in function to compute the median of a list of variables as it does for the mean. The median has some unique statistical properties such as insensitivity to outliers. It is also more preferable as a measure of the center location for a skewed distribution. Two macros are presented in this paper to compute the median of a list of variables. Examples are given to illustrate the applications. The readers are assumed to have a good understanding of SAS BASE procedures as well as advanced knowledge of Macro processing. The first macro makes use of the TRANSPOSE procedure and the median statistics in the UNIVARIATE procedure [1]. The second macro is designed to compute the median at the level of each observation within a data step. Array is used for sorting. Key Words Median Statistics, TRANSPOSE Procedure, UNIVARIATE Procedure, Array, Macro. Introduction SAS has built-in statistical functions to calculate the mean, standard deviation, and coefficient of variation. There is no standard function in SAS for the median. When the distribution is highly skewed or there are outstanding outliers, the median is always preferred to the mean as a measure of the average or the center of a distribution. The UNIVARIATE procedure does calculate the median statistics, but it is for one variable instead of a list of variables. Two macros are presented to compute the median of a list of variables. The first one utilizes the TRANSPOSE procedure. One observations in a list of variables is transposed to form one variable. The UNIVARIATE procedure is then applied to compute the median. The second macro is meant for use inside a data step for each observation. It copies the list of variables to a temporary array. The temporary array is then sorted to find the median. The Programs The first macro starts with the following macro call, where ‘datain’ is the name of the data set, ‘varlist’ is a list of variable names separated by a space and ‘medname’ is the name given to the resulting median. The default name for the resulting median is ‘median’. When a data set is transposed, the rows become the columns, and vice versa. So one observation in a list of variables becomes one variable. The number of observations in the original data set is the number of variables in the transposed data. The following is a classic macro to find the number of observations in a data set [2]. Note since the IF condition is always false, the SET statement does not execute. But the NOBS option is executed when the program is complied. The SYMPUT function is used to create a macro variable of the same name containing the number of observations in the data set. Coders' Corner

Transcript of SUGI 24: Two Macros to Compute the Median for a List of ... · PDF fileTwo Macros to Compute...

Page 1: SUGI 24: Two Macros to Compute the Median for a List of ... · PDF fileTwo Macros to Compute the Median for A List of Variables Fan Xu, SpectRx Inc., Norcross, GA ... functions to

Two Macros to Compute the Median for A List of Variables

Fan Xu, SpectRx Inc., Norcross, GA

AbstractSAS® does not have a built-in

function to compute the median of a list ofvariables as it does for the mean. Themedian has some unique statisticalproperties such as insensitivity to outliers.It is also more preferable as a measure ofthe center location for a skeweddistribution. Two macros are presented inthis paper to compute the median of a listof variables. Examples are given toillustrate the applications. The readers areassumed to have a good understanding ofSAS BASE procedures as well asadvanced knowledge of Macro processing.

The first macro makes use of theTRANSPOSE procedure and the medianstatistics in the UNIVARIATE procedure[1]. The second macro is designed tocompute the median at the level of eachobservation within a data step. Array isused for sorting.

Key WordsMedian Statistics, TRANSPOSE

Procedure, UNIVARIATE Procedure,Array, Macro.

IntroductionSAS has built-in statistical

functions to calculate the mean, standarddeviation, and coefficient of variation.There is no standard function in SAS forthe median. When the distribution ishighly skewed or there are outstandingoutliers, the median is always preferred tothe mean as a measure of the average orthe center of a distribution. TheUNIVARIATE procedure does calculatethe median statistics, but it is for onevariable instead of a list of variables.

Two macros are presented tocompute the median of a list of variables.

The first one utilizes the TRANSPOSEprocedure. One observations in a list ofvariables is transposed to form onevariable. The UNIVARIATE procedure isthen applied to compute the median. Thesecond macro is meant for use inside adata step for each observation. It copiesthe list of variables to a temporary array.The temporary array is then sorted to findthe median.

The ProgramsThe first macro starts with the

following macro call, where ‘datain’ is thename of the data set, ‘varlist’ is a list ofvariable names separated by a space and‘medname’ is the name given to theresulting median. The default name for theresulting median is ‘median’.

�PDFUR�PHGLDQ�GDWDLQ� ��YDUOLVW� �

��������PHGQDPH� �PHGLDQ��

When a data set is transposed, therows become the columns, and vice versa.So one observation in a list of variablesbecomes one variable. The number ofobservations in the original data set is thenumber of variables in the transposeddata. The following is a classic macro tofind the number of observations in a dataset [2]. Note since the IF condition isalways false, the SET statement does notexecute. But the NOBS option is executedwhen the program is complied. TheSYMPUT function is used to create amacro variable of the same namecontaining the number of observations inthe data set.

GDWD�BQXOOB�

����LI���WKHQ�VHW�GDWDLQ�QREV� �QREV�

����FDOO�V\PSXW�QREV��QREV��

UXQ�

Coders' Corner

Page 2: SUGI 24: Two Macros to Compute the Median for a List of ... · PDF fileTwo Macros to Compute the Median for A List of Variables Fan Xu, SpectRx Inc., Norcross, GA ... functions to

The data set is transposed by theTRANSPOSE procedure. By default thefirst observation in the list of variables isnamed ‘col1’, the second ‘col2’ and so on.The UNIVARIATE procedure is thenapplied to find the medians of ‘col1’ to‘col&nobs’. The medians are output to adata set called ‘mout’ with the samenames.

SURF�WUDQVSRVH�GDWD� �GDWDLQ�RXW� �WRXW��

����YDU�YDUOLVW�

UXQ�

SURF�XQLYDULDWH�GDWD� �WRXW��QRSULQW�

����YDU�FRO��FROQREV��RXWSXW�RXW� �PRXW

��������PHGLDQ� �FRO��FROQREV�

UXQ�

Remember each mediancorresponds to one observation. The nextstep is to transpose the output file back.The medians are merged with the originaldata and aligned correctly with theoriginal observations. Again the singlerow of medians in ‘mout’ is transposed toa single column called ‘col1’ by default.The transposed data ‘tout2’ is mergedwith the original data observation byobservation. Variable ‘col1’ is renamed tothe user-selected name (or ‘median’ bydefault).

SURF�WUDQVSRVH�GDWD� �PRXW�RXW� �WRXW��

GDWD�GDWDLQ�

����PHUJH�GDWDLQ�WRXW��NHHS� �FRO�

��������UHQDPH� ��FRO�� �PHGQDPH���

����ODEHO�PHGQDPH� ��0HGLDQ�RI

��������YDUOLVW��

UXQ�

�PHQG�

The second macro entails moreprogramming. This macro is invokedinside a data step. It starts with thefollowing macro call with the list ofvariable names (‘varlist’) and the name

given to the resulting median(‘medname’). The default name is‘median’.

�PDFUR�PHGLDQ�YDUOLVW� ���PHGQDPH�

��������PHGLDQ��

The first step is to find out howmany variables there are in the list. It isaccomplished by the following macrosegment. ‘n’ is a macro variable initializedto zero. It is incremented by one when the%SCAN function scans another non-nullstring. The %SCAN function scans for thenext (&n + 1) string in the list of variablesdelimited by space. If the string is null(%str()), the UNTIL condition becomesfalse. At the end of the loop &n is thenumber of variables in the list.

�OHW�Q� ���

�GR��XQWLO���VFDQ�YDUOLVW���HYDO�Q��

�������������VWU����� ��VWU�����

�����OHW�Q� ��HYDO�Q������

�HQG�

When sorting the list of variables,we do not want to change the actualvalues of the variables. So a copy of thevariables is made by a temporary arraycalled ‘temp1’ to ‘temp&n’.

DUUD\�\� ��YDUOLVW�

DUUD\�W� ��WHPS��WHPSQ�

GR�L� ���WR�Q��W�L�� �\�L���HQG�

Sorting is done on the temporaryarray. Note that the array is sorted in thedescending order. The variable ‘temp’serves as a swapping place when twovalues are switched.

GR�L� ���WR�Q���

����GR�M� �L�����WR�Q�

��������LI�W�L����W�M��WKHQ�GR�

������������WHPS� �W�L��

������������W�L�� �W�M��

������������W�M�� �WHPS�

��������HQG�

����HQG�

Coders' Corner

Page 3: SUGI 24: Two Macros to Compute the Median for a List of ... · PDF fileTwo Macros to Compute the Median for A List of Variables Fan Xu, SpectRx Inc., Norcross, GA ... functions to

HQG�

The last step is to compute themedian. It is important to remember thatthere could be missing values. Missingvalues are smaller than any numericalvalues in SAS by definition. When thearray is sorted in the descending order, themissing values (if any) are put in the rightend. Only the first ‘m’ variables with non-missing values are used to compute themedian. The median is defined as themiddle observation of a sorted sequence ifthe number of observations is odd. If thenumber of observations is even, themedian is defined as the mean of the twomiddle observations. All temporaryvariables are dropped in the end of thesession.

P� �Q���QPLVV�RI�WHPS��WHPSQ��

LI�PRG�P����� ���WKHQ�PHGQDPH�

���������W�P������W�P�����������

HOVH�PHGQDPH� �W��P�������

GURS�L�M�P�WHPS��WHPSQ�WHPS�

ODEHO�PHGQDPH� ��0HGLDQ�RI�YDUOLVW��

�PHQG�

The examplesTo illustrate how the first macro

works, suppose we have the followinghypothetical data set ‘example1’:

2%6���$�����%�����&�����'

��������������������������

��������������������������

��������������������������

The first macro is used to computethe median of variables A, B, C and D byinvoking %median(datain = example1,varlist = a b c d, medname = median); Thetransposed data of the original datafollows. Note each observation becomes avariable.

2%6���B1$0(B����&2/�����&2/�����&2/�

��������$��������������������������

��������%��������������������������

��������&��������������������������

��������'��������������������������

The output file from theUNIVARIATE procedure is a row ofmedians of each variable in the transposeddata set:

2%6���&2/�����&2/�����&2/�

�������������������������

This data set is transposed andmerged back with the original data tocreate the final data set:

2%6����$�����%�����&�����'����0(',$1

�����������������������������������

�����������������������������������

�����������������������������������

The second example is a clinicaltrial where five repeated measurementswere made on each subject. Some kind ofaverage of the 5 measurements is thedesired result. It is possible that there arebad measurements (outliers). Since themean is highly sensitive to the outliers, themedian is the right statistics for average inthis case. Outliers need to be identified forthe purpose of further investigation. Forthis application, outliers are defined as theobservations whose distances from thecenter location of the five measurementsare greater than a preset threshold.Because of its insensitivity to outliers, themedian is used to estimate the centerlocation. Using the second macro, thisalgorithm can easily be carried out. Seethe following program. ‘m1’ to ‘m5’ arethe five measurements. Note after themacro call, the resulting median of ‘m1’to ‘m5’ is stored in a variable named‘median’ by default. If the criterion foroutliers is met, a message is printed in thelog window.

Coders' Corner

Page 4: SUGI 24: Two Macros to Compute the Median for a List of ... · PDF fileTwo Macros to Compute the Median for A List of Variables Fan Xu, SpectRx Inc., Norcross, GA ... functions to

GDWD�P\GDWD�

����VHW�P\GDWD�

�����PHGLDQ�YDUOLVW� �P��P��P��P��P���

����DUUD\�PHDVXUH�P��P��

����GR�L� ���WR���

��������LI�DEV�PHDVXUH�L����PHGLDQ��!

����������������WKUVKOG�WKHQ�GR�

������������SXW��2XWOLHU�'HWHFWHG����BQB

������������������2EVHUYDWLRQ����L

������������������0HDVXUHPHQW��

��������HQG�

����HQG�

UXQ�

The following plot illustrates theresult with a simulated data set. The solidline is the median, the dashed-lined bandsindicate the preset threshold, and the starsare the detected outliers.

Illustration of Median Application

5 M

ea

sure

men

ts

-5

0

5

10

15

20

25

Median-5 0 5 10 15 20 25

Conclusion

The median statistics are veryuseful in many situations. While SAS doesnot have a built-in function to compute themedian for a list of variables, the twomacros presented can be used easily forthis purpose. The users can choose tocompute the median either within oroutside a data step.

References1. SAS Institute Inc. (1995), SAS

Procedures Guide, Version 6, ThirdEdition, Cary, N.C:SAS Institute Inc.

2. SAS Institute Inc. (1995), SAS Guideto Macro Processing, Version 6,Second Edition, pp. 263, Cary, N.C.:SAS Institute Inc.

TrademarksSAS® is a registered trademark of theSAS Institute, Inc., in the USA and inother counties. ® indicates USAregistration.

ContactFan XuSpectRx Inc.6025 A Unity DriveNorcross, GA 30071

(770)[email protected]

Coders' Corner