Tips and Tricks Group... · 2016-03-11 · Tips & Tricks With lots of help from other SUG and SUGI...
Transcript of Tips and Tricks Group... · 2016-03-11 · Tips & Tricks With lots of help from other SUG and SUGI...
Tips & Tricks
With lots of help from other SUG and
SUGI presenters
SAS CALGARY USER GROUP
OCTOBER 12, 2012
1
2
Sorting
3
THREADS
Multi-threading available if your computer has more than one
processor (CPU)
Divide and conquer: calculations are split up and run in
parallel on the processors
4
SORTING
•In memory or using a utility file
– details option
•Three copies of the data
– Original data set
– Temporary sorted files
– Sorted version of the data set
– Original file not overwritten until sort is complete …
• …. unless you specify the ‘overwrite’ option
5
SORTING - TAGSORT
•Useful when there is not enough disk space to sort a large data set.
•Only sort keys (listed on the by statement) and observation number (= tags) are sorted
•Tags are used to retrieve record from input dataset in sorted order.
•If sort keys small relative to total record length, temporary disk use is reduced.
•Requires reading each observation twice
•Single-threaded.
• Considerably slower
6
THE ‘NOEQUALS’
OPTION AND SQL
Default sorting behavior is to retain the order of records
within each ‘by’ group
‘noequals’ does not retain the order
Faster …..
….unless you are relying on retaining the order
SQL does the equivalent of ‘noequals’
7
COMPRESSION AND
SORTING
•SAS compress removes repeating blanks, characters, and
numbers from each observation
•Adds a tag, containing the information needed to
uncompress the observation.
•Sorting may be faster on an compressed data set (less I/O)
•!!Compress can result in a data set that is larger than the
original (even in v9)
– – size of tag may be larger than saved space.
8
SAVING DISK SPACE
9
COMPRESS (AGAIN)
options compress = yes;
• Requires CPU to uncompress each observation prior to use
• Data sets may grow rather than shrink
• Probably not a good idea to apply compression automatically
data mydata (compress = yes);
• Check log file to see if file shrunk
10
DROP VARIABLES AND
OBSERVATIONS
•Drop unnecessary variables ASAP (use the keep and drop
options on the data step)
•Get rid of unnecessary observations ASAP (use the “where”
statement on input to avoid wasting CPU time processing
observations you don’t need to process)
data females; set all (where = (sex = ‘F’));
•Murphy’s law of SAS datasets
– Any variable that is dropped from a dataset will be required
two procs later.
11
LENGTH STATEMENT
•Each character (including spaces!) requires 1 byte.
data bigword;
length word $10;
length flag 3;
To change the length of a variable after the fact
data smallword;
length word $4;
set mydata;
12
NUMBERS
Length in
bytes
Largest integer
represented exactly
Exponential notation
3 8,192 213
4 2,097,152 221
5 536,870,912 229
6 137,438,953,472 237
7 35,184,372,088,832 245
8 9,007,119,254,740,992 253
13
14
THE DATA STEP
15
DATA SET LISTS
Useful for reading in like-named datasets.
Can specify libraries.
WHERE, DROP, KEEP etc. can be applied in one data step.
16
MULTIPLE INPUT
DATASETS
Useful when you want to read in multiple like-named datasets
that have an incremental numbering sequence as part of the
name.
e.g.
indata1, indata2, indata3…indata50
17
OPTIONS
•Write Macro
– Can complicate data step
• Can type the datasets names
– Time consuming
– Prone to error
– Do you really want to do this?
data result_out;
set indata1 indata2 indata3 indata4 indata5 indata6 indata7 indata8
indata9 indata10 indata11 indata12 indata13 indata14 indata15 indata16
indata17 indata18 indata19 indata20 indata21 indata22 indata23 indata24
indata25 indata26 indata27 indata28 indata29 indata30 indata31 indata32
indata33 indata34 indata35 indata36 indata37 indata38 indata39 indata40
indata41 indata42 indata43 indata44 indata45 indata46 indata47 indata48
indata49 indata50;
run;
18
USE DASH LISTS
•Dash lists make it simple
data out;
set indata1 - indata50;
run;
•Can also specify library
data out;
set cmlib.indata1 – cmlib.indata50;
run;
19
MULTIPLE INPUT
DATASETS
PART 2
Useful when you want to read in multiple like named datasets
that start with the same prefix or when all the datasets to be
read are not known
e.g. Prod_region1, Prod_region4, Prod_region7,
Prod_region21…Prod_region36
20
OPTIONS
• Write Macro
– Can complicate datastep
• Can type the datasets names
– Time consuming
– Prone to error
– Do you really want to do this?
data total_production;
set Prod_region1 Prod_region1 Prod_region4 Prod_region7
Prod_region20 Prod_region21 Prod_region24,Prod_region27
Prod_region31 Prod_region32 Prod_region33 Prod_region36;
run;
21
USE COLON LISTS
Colon lists make it simple
data out;
set prod_region: ; run;
Can also specify library
data out;
set CMLIB.prod_region: ; run;
22
23
WHAT ELSE CAN THE
DATA STEP DO?
• Lots of stuff, but…here’s one example
• You already know how to read in multiple datasets…
• What if you want to write out multiple datasets?
Data east_production (where=(region=1))
west_production (where=(region=2))
north_production (where=(region=3))
south_production (where=(region=4));
Set prod_reg:;
Run;
24
25
WHAT ABOUT EG?
ALL THIS APPLIES TO
EG ALSO!
26
OTHER TIPS FOR EG
USERS
• Autoexec in Enterprise Guide?
• Query option to limit output during development
• Use Notes node to document Project
• Organize project using multiple process flows
• Understand your data before performing analysis on it
• Frequency counts
• Charts
• Characterize data
27
28
29
30
31
QUESTIONS?
32
33