Notes for SAS programming
Econ424
Fall 2007
Why SAS?
• Able to process large data set(s)
• Easy to cope with multiple variables
• Able to track all the operations on the data set(s)
• Generate systematic output
• Summary statistics
• Graphs
• Regression results
• Most government agencies and private sectors use SAS
Roadmap
• Basic rules
• Read in data
• Data cleaning commands
• Summary statistics
• Combine two or more datasets
• Hypothesis testing
• Regression
Basic rules (1) – organize files
• .sas – program file• .log – notes, errors, warnings • .lst – output• .sas7bdat – data file • library – a cabinet to put data in
– Default: Work library• temporary, erased after you close the session
– Permanent library• libname mylib “m:\”; • mylib.mydata
= a sas data file named “mydata” in library “mylib”
• run and recall .sas
Basic rules (2) -- program
• every command ends with ;• format does not matter
if x=1 then y=1; else y=2; is the same asif x=1 then y=1;
else y=2;
• case insensitive• comment
* this is comment;/* this is comment */;
Basic rule (3) – variable
• Type– numeric (default, 8 digit, . stands for missing value) – character ($, default 8 digit, blank stands for missing)
• Variable names – <=32 characters if SAS 9.0 or above– <=8 characters if SAS 8 or below– case insensitive
• Must start with letter or “_” _name, my_name, zip5, u_and_me -name, my-name, 5zip, per%, u&me, my@w, my$sign
Basic rules (4) – data step
• Data stepDATA distancenew;
set distance;
miles=26.22;
kilometers=1.61*miles;
input data “distance”
obs1
obs2
…
obs n
define “miles”
define “kilometers”
output data “distancenew”
obs1
…
obs n
Basic rules (5) – proc step
• PROC stepPROC PRINT data=distancenew;
var miles kilometers;
title “print out distancenew”;
run;
Read in data (1) – table editor
• Tools – table editor
• choose an existing library or define a new library
• rename variable
• save as
Read in data (2) -- datalines
data testdata1;infile datalines; input id height weight gender $ age;datalines; 1 68 144 M 232 78 . M 343 62 99 F 37; /* you only need one semicolon at the end of all data lines, but the
semicolon must stand alone in one line */
proc contents data=testdata1; run;proc print data=testdata1; run;
No ; until you finish all the data lines
Read in data (3) – more datalines
/* read in data in fixed columns */
data testdata1;
infile datalines;
input id 1 height 2-3 weight 4-6 gender $7 age 8-9;
datalines;
168144M23
278 M34
36299 F37
;
Read in data (4) – more datalines
data testdata1;
infile datalines;
input id : 1. height : 2. weight : 3. gender : $1. age : 2.;
datalines;
1 68 144 M 23
2 78 . M 34
3 62 99 F 37
;
Read in data (4) – more datalines
*alternatively;
data testdata1;
infile datalines;
informat id 1. height 2. weight 3. gender $1. age 2.;
input id height weight gender age;
datalines;
1 68 144 M 23
2 78 . M 34
3 62 99 F 37
;
Read in data (5) – .csv
data testdata1;infile datalines dlm=‘,’ dsd missover;
/* what if you do not have dsd and/or missover */input id height weight gender $ age ;datalines; 1, 68, 144, M, 23 2, 78, , M, 343, 62, 99, F, 37; /* what if you forget to type 23 */run;
Read in data (6) – .csv file
/* save the midterm data in a comma delimited file (.csv) before running the following codes */
libname mylib "M:\";
filename midterm "M:\midterm-fall2005-1997rank.csv";
data mylib.midterm;
infile midterm firstobs=3 dlm=',' dsd missover lrecl=900;
input year: 4. name: $50. score reputation accept_rate graduate_rate;
run;
proc contents data=mylib.midterm;
run;
proc print data=mylib.midterm;
run;
Note the options in the infile command!
16
Read in data (7) – from excel
filename myexcel “M:\midterm-fall2005-1997rank.xls”;proc import datafile=myexcel out=readin1
DBMS=excel replace; run;data readin1;
set readin1;acceptper=accept_rate*100;gradper=graduate_rate*100;
run;proc contents data=readin1; run;proc print data=readin1; run;
No ; until the end of the whole sentence
17
Read in data (7) – from excel
Be careful !SAS will read the first line as variable names, our original data has two non-data lines – the first as labels, the second as variable names. You need to delete the first line to make sure correct data input into SAS. What happens if you don’t delete the first line? What happens if you delete the second line?SAS assigns numeric and character type automatically. Sometime it does make mistake.
Data cleaning (1) – if then
Format: IF condition THEN action;ELSE IF condition THEN action;ELSE action;
Note: (1) the if-then-else can be nested as many as you
want(2) if you need multiple actions instead of one
action, use “DO; action1; action2; END; ”
Data cleaning (1) – if then
• = or EQ means equals• ~= or NE means not equal• > or GT means greater than• < or LT means less than• >= or GE means greater than or equal • <= or LE means less than or equal • in means subset
– if gender in (‘M’, ‘F’) then ..;
• Multiple conditions: AND (&), OR(|)
Data cleaning (1) – if then
*reading in program of testdata1 is on page 14;data testdata1;set mylib.testdata1;IF gender='M' AND age<=25 THEN testgroup=0; ELSE testgroup=1; run;proc contents data=testdata1; run;proc print data=testdata1; run;Note: (1) the code is less efficient if you replace ELSE ..; with IF gender~= 'M' OR age>25 THEN ..;(2) missing value is always counted as the smallest negative, so age=. will
satisfy the condition age<=25. If you want to ignore the obs with age=. set the condition as 0<age<=25.
Data cleaning (1) – if then
* Multiple actions in each branch;
data testdata1;set testdata1;IF gender='M' AND age<=25 THEN DO;
group=0; activity='ballgame'; END;
ELSE DO; group=1; activity='movie'; END;
run;proc print data=testdata1; run;
Data cleaning (1) – if then
*Use if commands to choose a subsample;
data testdata2; /* note here we generate a new data set */set testdata1;IF age=. Then delete;If age>18 & weight~=.;run;proc print data=testdata2; run;
Data cleaning (1) – exercise
still use testdata1.define
weight in pounds bmi=--------------------------------- x 703
(height in inches)^2
define group = 1 if bmi<18.5 (underweight) 2 if 18.5 <= bmi < 25 (normal) 3 if 25<= bmi <30 (overweight) 4 if bmi>=30 (obese).
Data cleaning (1) – exercise answer
data testdata1;set testdata1;bmi=(weight/(height*height))*703;if bmi<18.5 then group=1;else if bmi<25 then group=2;
else if bmi<30 then group=3; else group=4;
run;proc contents data=testdata1; run;proc print data=testdata1; run;Question: What if one observation has weight=.?
Pages 24-27 are optional material for data cleaning.
They are not required, but you may find them useful in the future.
We skip them in the regular class.
Data cleaning (2) – convert variable type
Numeric to character:age1=put(age, $2.);age and age1 have the same contents but different formats
Character to numeric:age2=input(age1, 1.);now age and age2 are both numeric, but age2 is chopped at the first digit
Take a sub string of a characterage3=substr(age1,2,1); now age3 is a sub string of age1, starting from the second digit of age1 (the meaning of “2”) and having one digit in total (the meaning of “1”).
Data cleaning (2) - example
* we want to convert studid 012345678 to 012-345-678;data testdata2;infile datalines;input studid : 9. studname : $1.;datalines;012345678 A135792468 B009876543 C;proc print; run;data testdata2;set testdata2;if studid LT 1E+7 then studid1= '00’||compress(put(studid, $9.));else if 1E+7 LE studid LT 1E+8 then studid1='0'||compress(put(studid,
$9.)); else studid1= put(studid, $9.);
studid2=substr(studid1,1,3)||'-'||substr(studid1,4,3)||'-'||substr(studid1,7,3);
proc print; run;
Data cleaning (2) - exercise
You have the following data, variables in sequence are
SSN, score1, score2, score3, score4, score5:
123-45-6789 100 98 96 95 92
344-56-7234 69 79 82 65 88
898-23-1234 80 80 82 86 92
Calculate the average and standard deviation of the five scores for each individual. Use if-then command to find out who has the highest average score, and report his SSN without dashes.
Data summary roadmap
• Proc contents – variable definitions• Proc print – raw data• Proc format – make your print look nicer• Proc sort – sort the data• Proc means – basic summary statistics• Proc univariate – detailed summary stat• Proc freq – frequency count• Proc chart – histogram• proc plot – scatter plot
proc format (1)
*Continue the data cleaning exercise on page 23;
data testdata1;
set testdata1;
bmi=(weight/(height*height))*703;
if bmi<18.5 then group=1;
else if bmi<25 then group=2;
else if bmi<30 then group=3;
else group=4;
run;
Defining “group” as a numeric variable will
save space
proc format (2)
proc format; value bmigroup 1=‘under weight’
2=‘normal weight’3=‘overweight’4=‘obese’;
run;proc print data=testdata1; format group bmigroup. bmi 2.1; title ‘report bmi group in words’; label bmi=‘body mass index’ group=‘bmi group 1-4’; var bmi group; run;
no ; here until the end of the
label command
no ; here until the end of the
value command
proc sort
proc sort data=testdata1;
by gender age;
run;
proc sort data=testdata1 out=testdata2;
by gender descending age;
run;
* note that missing value is always counted as the smallest;
proc means and proc univariate
proc means data=testdata1;class gender;var height weight bmi;run;
proc sort data=testdata1; by gender; run;
proc univariate data=testdata1;by gender;var bmi;run;
By default, proc means report mean, stdev, min, max
Could choose what to report:proc means data=testdata1
n mean median;
By default, proc univariate report median, and many
other statistics
Notes on proc means and proc univariate
*if you do not use class or by command, the statistics are based on the full sample. If you use class or by var x, the statistics are based on the subsample defined by each value of var x.
*You can use class or by in proc means, but only by in proc univariate;
*whenever you use “by var x”, the data set should be sorted by var x beforehand;
More example onproc means and proc univariate
data readin1; set readin1;if accept_rate<0.20 then acceptgrp=“low”; else acceptgrp=“high”; run;
proc means data=readin1;class acceptgrp;var reputation accept_rate graduate_rate;run;
proc sort data=readin1; by acceptgrp; run;proc univariate data=readin1;
by acceptgrp;var reputation;run;
By default, proc means report mean, stdev, min, max
Could choose what to report:proc means data=readin1
n mean median;
By default, proc univariate report median, and many
other statistics
proc freq
data readin1;set readin1;if graduate_rate<0.5 then gradgrp=“low”;else if 0.5<=graduate_rate<0.8 then gradgrp=‘middle’;
else gradgrp=‘high’;run;
proc freq data=readin1;tables acceptgrp
gradgrp acceptgrp*gradgrp;
run;
proc chart -- histogram
proc chart data=readin1;title ‘histogram for acceptgrp’;vbar acceptgrp;run;
proc chart data=readin1;title ‘frequency by two variables’;vbar acceptgrp / group=gradgrp;run;
proc chart – continuous variable
proc chart data=readin1;
title “histogram for continuous variable’;
vbar accept_rate;
run;
proc chart data=readin1;
title ‘histogram with specific midpoints’;
vbar accept_rate / midpoints=0 to 1 by 0.1;
run;
proc plot – scatter plot
proc plot data=readin1;
title ‘scatter plot of accept rate and grad rate’;
plot accept_rate*graduate_rate;
run;
fancy proc means
proc means data=readin1;class acceptgrp gradgrp;var score reputation;output out = summary1
mean = avgscore avgreputn;run;
proc print data=summary1;run;
The following page may be useful in practice, but I am not
going to cover it in class.
some summary stat. in proc printfilename myexcel “M:\midterm-fall2005-1997rank.xls”;;proc import datafile=myexcel out=readin1 DBMS=excel replace; run;data readin1;
set readin1;if accept_rate<0.2 then acceptgrp=‘low’;
else if 0.2<=accept_rate<0.5 then acceptgrp=‘middle’; else acceptgrp=‘high’; run;proc sort data=readin1; by acceptgrp; run;proc print data=readin1 n;
where graduate_rate>=0.8; by acceptgrp; sum score; var name score accept_rate graduate_rate; run;
merge and append
readin1: name score reputation acceptgrp gradgrp
Harvard 100 3.9 low high
summary1: acceptgrp gradgrp avgscore avgreputn
low high 80 3.3
merged: name score reputation acceptgrp gradgrp avgscore avgreputn
Harvard 100 3.9 low high 80 3.3
appended:
name score reputation acceptgrp gradgrp avgscore avgreputn
Harvard 100 3.9 low high . .
. . low high 80 3.3
merge two datasets
proc sort data=readin1;by acceptgrp gradgrp;run;
proc sort data=summary1;by acceptgrp gradgrp;run;
data merged;merge readin1 (in=one) summary1 (in=two);by acceptgrp gradgrp;if one=1 & two=1; run; What if this line is
“if one=1 OR two=1;”?
Keep track of matched and unmatched records
data allrecords;merge readin1 (in=one) summary1 (in=two);by acceptgrp gradgrp;myone=one;mytwo=two;if one=1 or two=1; run;
proc freq data=allrecords;tables myone*mytwo;run;
SAS will drop variables “one” and “two”
automatically at the end of the DATA step. If you want to keep them, you can copy
them into new variables “myone” and “mytwo”
be careful about merge!
• always put the merged data into a new data set• must sort by the key variables before merge• ok for one-to-one, multi-to-one, one-to-multi, but
no good for multi-to-multi• be careful of what records you want to keep, and
what records you want to delete• what if variable x appears in both datasets, but x
is not in the “by” statement?– after the merge x takes the value defined in the last
dataset of the “merge” statement
append
data appended;
set readin1 summary1;
run;
proc print data=appended;
run;
proc print data=merged;
run;
Application : Project 5
rest_rating:
Restname Ratings Price Loc1 Phone1 Loc2 Phone2
Addie’s 1 30 Rockville 301-881-0081
Andalucia 2 30 Rockville 301-770-1880
Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501
city_demo:
Loc State Pop Income
Bethesda MD 62936 38976
Falls Church VA 9578 26709
Rockville MD 44835 21484
How to merge the two data sets?
Steps to merge the datasets
1. reshape data set 1 – focus on loc1rest_rating:
Restname Ratings Price Loc1 Phone1 Loc2 Phone2
Addie’s 1 30 Rockville 301-881-0081
Andalucia 2 30 Rockville 301-770-1880
Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501
restloc1
restname .. loc phone
Addie’s Rockville 301-881-0081
Andalucia Rockville 301-770-1880
Haandi Bethesda 301-718-0121
data restloc1;
set rest_rating;
loc=loc1;
phone=phone1;
drop loc1 loc2 loc3 loc4 phone1 phone2 phone3 phone4;
run;
Steps to merge the datasets
1. reshape data set 1 – focus on loc2rest_rating:
Restname Ratings Price Loc1 Phone1 Loc2 Phone2
Addie’s 1 30 Rockville 301-881-0081
Andalucia 2 30 Rockville 301-770-1880
Haandi 2 10 Bethesda 301-718-0121 Falls Church 703-533-3501
data restloc2;
set rest_rating;
loc=loc2;
phone=phone2;
drop loc1 loc2 loc3 loc4 phone1 phone2 phone3 phone4;
If loc=‘’ then delete;
run;
restloc2
restname .. loc phone
Haandi Falls Church 703-533-3501
Steps to merge the datasets
Similarly create restloc3 and restloc4
Steps to merge the datasets
2. append restloc1, restloc2, restloc3, restloc4 into restloc
restloc2
restname .. loc phone
Haandi Falls Church 703-533-3501
restloc1
restname .. loc phone
Addie’s Rockville 301-881-0081
Andalucia Rockville 301-770-1880
Haandi Bethesda 301-718-0121
restloc
restname .. loc phone
Addie’s Rockville 301-881-0081
Andalucia Rockville 301-770-1880
Haandi Bethesda 301-718-0121
Haandi Falls Church 703-533-3501
data restloc;
set restloc1 restloc2 restloc3 restloc4;
run;
Steps to merge the datasets
3. merge restloc with citydemo by loc
restloc
restname .. loc phone
Addie’s Rockville 301-881-0081
Andalucia Rockville 301-770-1880
Haandi Bethesda 301-718-0121
Haandi Falls Church 703-533-3501
city_demo:
Loc State Pop Income
Bethesda MD 62936 38976
Falls Church VA 9578 26709
Rockville MD 44835 21484
proc sort data=restloc;
by loc; run;
proc sort data=city_demo;
by loc; run;
data merged;
merge restloc (in=one)
city_demo (in=two);
by loc;
if one=1 | two=1;
inone=one; intwo=two;
run;
Questions
• How many restaurants have two locations? how many have three?
• How many restaurants locate in Rockville?• How many restaurants do not have any match
with city demographics?• How many restaurants in cities where per capita
income is at least 30,000?• How many are asian and locate in areas with per
capita income at least 30,000?
mean comparison: two groups* compare the mean of two groups;* specific question: do Asian and non-Asian restaurants charge systematically different prices?;* define a dummy variable asianstyle=1 if the food style is Asian, Chinese, Japanese, Thai, Vietnamese,
Lebanese, or Korean, 0 otherwise;
proc glm data=merged;class asianstyle;model price=asianstyle; /* alternatively, model price=asianstyle/solution; */means asianstyle;run;The F-value tests
H0: regression has no power explaining variation in incomeH1: regression has some power.
In our context, this is equivalent to H0: income is no different across asianstyle or non-asianstyleH1: income is different.
This amounts to a two-tail test of equal mean.
If you use “/solution” at the end of the model command, the output would report regression results. The t-stat of the coefficient for asianstyle or the p-value next to the t-stat could apply to a two-tail test or a one-tail test of equal mean.
mean comparison: two groups
Alternatively, you can use
proc reg data=merged;model price=asianstyle;Run;
The t-stat of the coefficient for asianstyle or the p-value next to the t-stat could apply to a two-tail test or a one-tail test of equal mean.
mean comparison: three or more groups
* compare the mean of four groups – ratings =1,2,3,4;* specific question: do restaurants of different ratings charge
systematically different prices?;proc glm data=merged;class ratings;model price=ratings;means ratings/waller;means ratings/lsd cldiff;run;
notes on mean comparison
1. The logic of mean comparison is the same as in Excel
2. Be careful about one-tail and two-tail tests. The standard SAS output of coefficient t-stat and p-value are based on a two-tail test of H0: coeff=0, but could be used for a one-tail test with appropriate adjustment on p-value. The F-value reported as the overall regression statistics only applies to the right tail of F distribution, because F is bounded to be positive. The F-value can test H0: regression explains no variation in your dependent variable, against H1: regression explains some variation. This is equivalent to a two-tail test of equal mean.
3. Comparison across more than two groupsH0: all groups have the same mean F-test of whole regressionORH0: group x and group y has the same mean waller or lsd
statistics
regression in SAS• specific question: how do restaurant price relate
to ratings, city population and per capita income?
libname n “N:\share\notes”;
proc reg data=n.merged;
model price=ratings population per_capita_income;
run;Equivalent to:
Price=alpha+beta1*ratings+beta2*pop+beta3*per_cap_inc
regression in SAS
* specific question: how do restaurant prices relate to ratings, city population, per capita income, and state specifics?
* note that state itself is a character variable, and we would like to separate MD, VA and DC. This amounts to define three dummy variables denoting each state. This is easy to achieve in SAS;
proc glm data=merged;class state;model price=ratings population per_capita_income
state / solution;run;
Regression: exercise 1
• The coefficient of ratings in the first regression tells us how much price will change by a unit change in ratings.
• However, one may think the change from 1 star to 2 star is different from the change from 2 star to 3 star, or from 3 star to 4 star.
• To capture these differences, we need more than one coefficents for ratings.
proc glm data=merged;class ratings;model price=ratings population per_capita_income
/ solution;run;
Regression: exercise 2• What if we also want to control for states?• SAS does not allow more than one variables in the “class ” command, so we need to define one set
of dummy variables by ourselves;• Below we define 4 dummy variables for each value of ratings. Since the sum of the four variables
is always 1, we need to drop one of them in the regression to avoid their colinearity with the intercept.
data merged;set merged;if ratings=1 then rating_1star=1; else rating_1star=0;if ratings=2 then rating_2star=1; else rating_2star=0;if ratings=3 then rating_3star=1; else rating_3star=0;if ratings=4 then rating_4star=1; else rating_4star=0;
proc glm data=merged;class state;model price=rating_2star rating_3star rating_4star population per_capita_income state / solution;run;
Regression: exercise 3• What if we focus on the average price change by one unit increase of ratings (like in regression 1 on page 59), but we
would like to see whether this price change vary by states?• We need to define ratings*(dummy for MD), ratings*(dummy for VA),ratings*(dummy for DC)
data merged;set merged;if state=“MD”then state_MD=1; else state_MD=0;if state=“VA” then state_VA=1; else state_VA=0;if state=“DC” then state_DC=1; else state_DC=0;ratings_MD=ratings*state_MD;ratings_VA=ratings*state_VA;ratings_DC=ratings*state_DC;run;
proc reg data=merged;model price=ratings_MD ratings_VA ratings_DC population per_capita_income;run;
* Control for states;proc glm data=merged;class state;model price=ratings_MD ratings_VA ratings_DC population per_capita_income state/solution;run;
A comprehensive example
A review of
1. readin data
2. summary statistics
3. mean comparison
4. regression
reg-cityreg.sas in N:\share\notes\
A Case Study of Los Angeles Restaurants
Nov. 16-18, 1997 CBS 2 News “Behind the Kitchen Door”
January 16, 1998, LA county inspectors start issuing hygiene
grade cards
A grade if score of 90 to 100
B grade if score of 80 to 89
C grade if score of 70 to 79
score below 70 actual score shown
Grade cards are prominently displayed
in restaurant windows
Score not shown on grade cards
Take the ideas to data
better Information
better quality
better revenue
regulation
by county
by city
hygiene scores
restaurant revenue calculated by sales tax
Data complications
Unit of analysis:
individual restaurant? city? zipcode? census tract?
Unit of time:
each inspection? per month? per quarter? per year?
Define information:
county regulation? city regulation? the date of passing the regulation? days since passing the regulation? % of days under regulation?
Define quality:
average hygiene score? the number of A restaurants? % of A restaurants?
Define revenue: (only collected by quarter!)
total revenue? average revenue per restaurant?
How to test the idea?
• Regression 1:quality = α+β*information+error
+something else?
• Regression 2:revenue=α+β*quality+γ*information+error
+something else?
What is the something else? year trend, seasonality, quality*information, ....
real test
reg-cityreg.sas in N:\share\notes\
Questions
• How many observations in the sample? – log of the first data step, or output from proc contents
• How many variables in the sample? How many are numerical, how many are characters?– Output from proc contents
• On average, how much tax revenue can we collect from one city in one quarter?– Output from proc means, on sumtax
• How many percentage of restaurants have A quality in a typical city-month?– Output from proc means, on per_A
• What is the maximum tax revenue in a city-quarter? What is the minimum?– Output from proc means, on sumtax
Questions
• What is the difference between cityreg and ctyreg? We know county regulation came earlier than city regulation, is that reflected in our data?– Yes, cityreg<=ctyreg in every observation– We can check this in proc means for cityreg and ctyreg, or add a proc
print to eyeball each obs
• What is the difference between cityreg and citymper? What is the mean of cityreg? What is the mean of citymper? Are they consistent with their definitions?– The unit of cityreg is # of days, so it should be a non-negative integer– The unit of citymper is % of days, so it should be a real number between
0 and 1– To check this, we can add a proc means for cityreg and citymper
Questions
• Economic theories suggest quality be higher after the regulation if regulation gives consumers better information. Is that true in summary statistics? – by summary statistics, I mean a simple mean comparison of
quality by different regulation environments. Which procedure tells us that?
– Yes, the two proc means, class citym_g or ctym_g• How about revenue? Do summary statistics suggest that
better information better revenue?– Yes, the two proc means, class citym_g or ctym_g– The two proc means only show us group-average. For formal
tests, we can use proc glm or proc reg as in our lecture notes on mean comparison (do you know how to do it?)
Questions
Facts reported in summary statistics often reflect many economic factors, not only the one in our mind. Summary statistics are also crude for easy presentation. That is why we explore regressions.
In the quality regression, does more city regulation associate with higher quality?
– is the coefficient of city regulation positive and significantly different from zero? (proc reg)
– is the coefficient of county regulation positive and significantly different from zero? (proc reg)
– Do we omit other sensible explanations for quality changes? What are they? (yes, they are ….)
Questions
• What does “proc means” do on page 3? Why does it allow us to collapse the data of city-months into city-quarters?
• Why do we use lnsumtax as the dependent variable instead of sumtax? – Reg of lnsumtax tells us the effect of regulation on the
percentage change in sumtax
• In the first regression, what does the “cityqper” coefficient mean? what does the “per_A” coefficient mean? What does the “cityq_A” (which equals to cityqper*per_A) coefficient mean?
Questions
• We report four sets of regression results for the revenue regression. How do they differ from each other?
• How does the coefficient of “cityqper” change across these regressions? What does the change suggest?
• Which set of results makes more sense?
Count duplicates (not required)
http://support.sas.com/ctx/samples/index.jsp?sid=67&tab=code
data dups nodups ; set clasdata ; by name class ; /* If the combination of NAME and CLASS is in the data
set once, output NODUPS, else output DUPS. */ if first.class and last.class then output nodups ; else output dups ; run;
Some project 6 ideas from previous students
• Does super-bowl advertising boost stock price?• Does financial aid promote GDP growth?• What are the most important determinants of
house price?• Advice on smart investment: which stock(s)
should John Doe invest in given his preference for return and risk?
• Does a chain store charge different prices in different locations?
Idea I: Does super-bowl advertising boost stock price?
• Theory– advertising in super-bowl is extremely expensive and therefore
increase the cost of the advertiser negative effect on stock price
– advertising at the superbowl signals good finanical status of the advertiser
positive effect on stock price
• Data: – list of superbowl advertisers from major newspapers– list of non-advertisers who are major competitors with the
advertisers in the same industry– New York Stock Exchange
• Timing: right before and after super-bowl (event study)
Idea II: Does financial aid promote GDP growth?
• Theory: – positive effect because aid means help– negative effect because aids tend to go to the most
under-developed countries
• Data– focus on Sub-Sahara Africa– World Development Indicators (WDI 2002) database
from the World Bank– 1990-2000: GDP per capita, aid, FDI, adjusted net
savings, health expenditures, adjult illiteracy rate
Idea III: determinants of house price?
• Theory: – house attributes– location– time of sale
• Data– zillows.org
Idea IV: Smart Investment
• Theory: CAPM
• John Doe’s preference: risk and return
• Sample:– treasury bills– S&P 500 index– six individual stocks– monthly data in the past 60 months
Idea V: Does a chain store charge different prices in different locations?
• Theory: – Yes: A chain store will charge different prices because
different locations have different rent costs and different consumer willingness to pay
– No: same pricing facilitates advertising from the chain headquarter
• Data: – Hand collected price for 10 CVS stores– Choose a basket of goods that represent CVS sales– Land price and demographic data from Census
Top Related