3. chapter iii(aggregate data)

26
DATA

Transcript of 3. chapter iii(aggregate data)

Page 1: 3. chapter iii(aggregate data)

DATA

Page 2: 3. chapter iii(aggregate data)

Aggregate Data• Aggregate Data aggregates groups of cases in the active dataset into single

cases and creates a new, aggregated file or creates new variables in the active dataset that contain aggregated data. Cases are aggregated based on the value of zero or more break (grouping) variables. If no break variables are specified, then the entire dataset is a single break group.– If you create a new, aggregated data file, the new data file contains one

case for each group defined by the break variables. For example, if there is one break variable with two values, the new data file will contain only two cases. If no break variable is specified, the new data file will contain one case.

– If you add aggregate variables to the active dataset, the data file itself is not aggregated. Each case with the same value(s) of the break variable(s) receives the same values for the new aggregate variables. For example, if gender is the only break variable, all males would receive the same value for a new aggregate variable that represents average age. If no break variable is specified, all cases would receive the same value for a new aggregate variable that represents average age.

Page 3: 3. chapter iii(aggregate data)

Aggregate Data• Break Variable(s). Cases are grouped together based on the values of the

break variables. Each unique combination of break variable values defines a group. When creating a new, aggregated data file, all break variables are saved in the new file with their existing names and dictionary information. The break variable, if specified, can be either numeric or string.

• Aggregated Variables. Source variables are used with aggregate functions to create new aggregate variables. The aggregate variable name is followed by an optional variable label, the name of the aggregate function, and the source variable name in parentheses.

Page 4: 3. chapter iii(aggregate data)

Aggregate Data• Aggregate Data: Aggregate Function• This dialog box specifies the function to use to calculate aggregated data

values for selected variables on the Aggregate Variables list in the Aggregate Data dialog box. Aggregate functions include:

• Summary functions for numeric variables, including mean, median, standard deviation, and sum

• Number of cases, including unweighted, weighted, nonmissing, and missing• Percentage or fraction of values above or below a specified value• Percentage or fraction of values inside or outside of a specified range• Aggregate Data: Variable Name and Label• Aggregate Data assigns default variable names for the aggregated variables in

the new data file. This dialog box enables you to change the variable name for the selected variable on the Aggregate Variables list and provide a descriptive variable label. See the topicVariable names for more information.

Page 5: 3. chapter iii(aggregate data)

Aggregate Data• Data=>Aggregate

Page 6: 3. chapter iii(aggregate data)
Page 7: 3. chapter iii(aggregate data)
Page 8: 3. chapter iii(aggregate data)

SELECT Case• Select Cases provides several methods for selecting a subgroup of cases

based on criteria that include variables and complex expressions. You can also select a random sample of cases. The criteria used to define a subgroup can include:

• • Variable values and ranges• • Date and time ranges• • Case (row) numbers• • Arithmetic expressions• • Logical expressions• • Functions

Page 9: 3. chapter iii(aggregate data)

SELECT Case• Output• This section controls the treatment of unselected cases. You can choose one

of the following alternatives for the treatment of unselected cases:• • Filter out unselected cases. Unselected cases are not included in the

analysis but remain in the dataset. You can use the unselected cases later in the session if you turn filtering off. If you select a random sample or if you select cases based on a conditional expression, this generates a variable named filter_$ with a value of 1 for selected cases and a value of 0 for unselected cases.

• Copy selected cases to a new dataset. Selected cases are copied to a new dataset, leaving the original dataset unaffected. Unselected cases are not included in the new dataset and are left in their original state in the original dataset.

• Delete unselected cases. Unselected cases are deleted from the dataset. Deleted cases can be recovered only by exiting from the file without saving any changes and then reopening the file. The deletion of cases is permanent if you save the changes to the data file.

Page 10: 3. chapter iii(aggregate data)

SELECT Case:if• This dialog box allows you to select subsets of cases using conditional

expressions. A conditional expression returns a value of true,false, or missing for each case.

• • If the result of a conditional expression is true, the case is included in the selected subset.

• • If the result of a conditional expression is false or missing, the case is not included in the selected subset.

• • Most conditional expressions use one or more of the six relational operators (<, >, <=, >=, =, and ~=) on the calculator pad.

• • Conditional expressions can include variable names, constants, arithmetic operators, numeric (and other) functions, logical variables, and relational operators.

Page 11: 3. chapter iii(aggregate data)

Data=>Select Case…

SELECT Case IF

Page 12: 3. chapter iii(aggregate data)

If=> …=>Continue

Page 13: 3. chapter iii(aggregate data)
Page 14: 3. chapter iii(aggregate data)

Select Cases: Random Sample• This dialog box allows you to select a random sample based on an

approximate percentage or an exact number of cases. Sampling is performed without replacement; so, the same case cannot be selected more than once.

• Approximately. Generates a random sample of approximately the specified percentage of cases. Since this routine makes an independent pseudo-random decision for each case, the percentage of cases selected can only approximate the specified percentage. The more cases there are in the data file, the closer the percentage of cases selected is to the specified percentage.

• Exactly. A user-specified number of cases. You must also specify the number of cases from which to generate the sample. This second number should be less than or equal to the total number of cases in the data file. If the number exceeds the total number of cases in the data file, the sample will contain proportionally fewer cases than the requested number.

Page 15: 3. chapter iii(aggregate data)

SELECT Case Random

Page 16: 3. chapter iii(aggregate data)
Page 17: 3. chapter iii(aggregate data)

Select Cases: Range• This dialog box selects cases based on a range of case numbers or a range of

dates or times.• • Case ranges are based on row number as displayed in the Data Editor.• • Date and time ranges are available only for time series data with defined

date variables (Data menu, Define Dates).

Page 18: 3. chapter iii(aggregate data)
Page 19: 3. chapter iii(aggregate data)
Page 20: 3. chapter iii(aggregate data)

Filter Variable• Select only Missing value and zero value

Page 21: 3. chapter iii(aggregate data)
Page 22: 3. chapter iii(aggregate data)

Weight Cases• Weight Cases gives cases different weights (by simulated replication) for statistical

analysis.• • The values of the weighting variable should indicate the number of observations

represented by single cases in your data file.• • Cases with zero, negative, or missing values for the weighting variable are excluded

from analysis.• • Fractional values are valid and some procedures, such as Frequencies, Crosstabs, and

Custom Tables, will use fractional weight values. However, most procedures treat the weighting variable as a replication weight and will simply round fractional weights to the nearest integer. Some procedures ignore the weighting variable completely, and this limitation is noted in the procedure-specific documentation.

• Once you apply a weight variable, it remains in effect until you select another weight variable or turn off weighting. If you save a weighted data file, weighting information is saved with the data file. You can turn off weighting at any time, even after the file has been saved in weighted form.

• Weights in Crosstabs. The Crosstabs procedure has several options for handling case weights. See the topic Crosstabs cell display for more information.

• Weights in scatterplots and histograms. Scatterplots and histograms have an option for turning case weights on and off, but this does not affect cases with a zero, negative, or missing value for the weight variable. These cases remain excluded from the chart even if you turn weighting off from within the chart.

Page 23: 3. chapter iii(aggregate data)

Sort Case• Data=>Sort Case

Page 24: 3. chapter iii(aggregate data)

Transpose• Transpose creates a new data file in which the rows and columns in the

original data file are transposed so that cases (rows) become variables and variables (columns) become cases. Transpose automatically creates new variable names and displays a list of the new variable names.– A new string variable that contains the original variable name, case_lbl,

is automatically created.– If the active dataset contains an ID or name variable with unique values,

you can use it as the name variable, and its values will be used as variable names in the transposed data file. If it is a numeric variable, the variable names start with the letter V, followed by the numeric value.

– User-missing values are converted to the system-missing value in the transposed data file. To retain any of these values, change the definition of missing values in the Variable View in the Data Editor.

Page 25: 3. chapter iii(aggregate data)
Page 26: 3. chapter iii(aggregate data)

Example