N ational T ransfer A ccounts Data and Estimation Issues May 15, 2009 Sang-Hyop Lee University of...

NNational ational TTransfer ransfer AAccountsccounts

Data Data and Estimation and Estimation IssuesIssues

May 15, 2009May 15, 2009

Sang-Hyop LeeSang-Hyop Lee

University of Hawaii at ManoaUniversity of Hawaii at Manoa

National Transfer Accounts

DataData Sets for Statistical Sets for Statistical AnalysisAnalysis► Cross sectionCross section► Time seriesTime series► Cross section time series; useful for analysisCross section time series; useful for analysis of of

aggregate variables (over time or by cohort).aggregate variables (over time or by cohort).► Panel (longitudinal)Panel (longitudinal)

Repeated cross-section design: most commonRepeated cross-section design: most common Rotating panel design (Cote d’Ivore 1985 data)Rotating panel design (Cote d’Ivore 1985 data) Supplemental cross-section design (Kenya & Supplemental cross-section design (Kenya &

Tanzania 1982/83 data, Malaysia and Indonesia LS)Tanzania 1982/83 data, Malaysia and Indonesia LS)► Cross section with retrospective informationCross section with retrospective information► Micro vs. MacroMicro vs. Macro


Quality of Survey DataQuality of Survey Data

►Constructing NTA requires individual or Constructing NTA requires individual or household micro survey data sets.household micro survey data sets.

►A good survey data set has the A good survey data set has the properties ofproperties of Extent (richnessExtent (richness): it has the ): it has the variablesvariables of of

interest at a certain level of detail.interest at a certain level of detail. ReliabilityReliability: the variables are : the variables are measured measured

without errorwithout error.. ValidityValidity: the data set is : the data set is representativerepresentative..


Data Problem (An example)Data Problem (An example)►FIES (64,433 household with FIES (64,433 household with

233,225 individuals)233,225 individuals) Measured for only urban areaMeasured for only urban areas s

(Valid?)(Valid?) No single person householdNo single person households (Valid?)s (Valid?) No information oNo information onn income for family income for family

owned businessowned business (Rich?) (Rich?) Measured for Measured for up to 8 household up to 8 household

membersmembers: (Reliable? Valid?): (Reliable? Valid?)


Extent (Richness):Extent (Richness):Missing/Change of VariablesMissing/Change of Variables

►Missing variablesMissing variables Not measured or Not measured or measured for a measured for a

certain groupcertain group Labor portion of self-employLabor portion of self-employmentment

incomeincome►Change of variables over timeChange of variables over time

Institutional/policy changeInstitutional/policy change New consumption items, new jobs, etcNew consumption items, new jobs, etc

►Change of survey Change of survey instrument/collapsinginstrument/collapsing


Reliability: Reliability: Measurement Measurement ErrorError

► ResponseResponse/reporting/reporting error error RRespondents do not know what is requiredespondents do not know what is required IIncentive to understatencentive to understate/overstate/overstate RRecall biasecall bias: related with period of survey: related with period of survey Using wrong/different reporting unitsUsing wrong/different reporting units HHeapingeaping OOutliersutliers

► Coding errorCoding error, top coding, missing observation, top coding, missing observation► Overestimate/Overestimate/uunderestimatenderestimate

Parents do not reportParents do not report/register/register their children until the their children until the children have namechildren have name

Detect by checking survival rate of single ageDetect by checking survival rate of single agess► Discrepancy between aggregate and Discrepancy between aggregate and sum of sum of

individual valueindividual valuess


Validity: Validity: CensorCensoringing

►Selection based on characteristicsSelection based on characteristics►Censoring Censoring due to the time of surveydue to the time of survey

Duration of unemploymentDuration of unemployment (left and right (left and right censoring)censoring)

Completed years of schoolingCompleted years of schooling

►Attrition (Attrition (ppanel data)anel data)


Categorical/Qualitative Categorical/Qualitative VariablesVariables

►Converting categorical to single Converting categorical to single continuous variablescontinuous variables Grouped by age (population, public Grouped by age (population, public

education consumption)education consumption) Income category (FPL)Income category (FPL)

► Inconsistency over timeInconsistency over time►Categorical Categorical continuous, and vice continuous, and vice

versaversa


Units, Real vs. NominalUnits, Real vs. Nominal► Be cBe careful about the reporting unitareful about the reporting unit

MeasurMeasurementement units units Reporting period units (reference period, Reporting period units (reference period,

seasonal fluctuation, recall bias) seasonal fluctuation, recall bias)

► Nominal vs. RealNominal vs. Real Aggregation across itemsAggregation across items Quality change (e.g. computer)Quality change (e.g. computer) Where inflation is a substantial problem Where inflation is a substantial problem


Solution for Solution for Missing VariablesMissing Variables► Missing is not usually zero!!Missing is not usually zero!!► Ignore itIgnore it;; random non-response random non-response► Give upGive up;; find other source of data find other source of data sets sets► ImputeImpute; “missing does not mean zero”.; “missing does not mean zero”.

Based on their characteristics or mean valueBased on their characteristics or mean value Based on the value of other peer groupBased on the value of other peer group Modified zero order regressions (y on x)Modified zero order regressions (y on x)

- Create dummy variable for missing Create dummy variable for missing variables of x (z)variables of x (z)

- Replace missing variable with 0 (x’)Replace missing variable with 0 (x’)- Regress y on x’ and z, rather than y on xRegress y on x’ and z, rather than y on x


Households vs. IndividualsHouseholds vs. Individuals

► In NTA, estimates such as consumption and In NTA, estimates such as consumption and labor income should belabor income should be individual level, while individual level, while inter-household transfer is household level.inter-household transfer is household level.

► But a lot of data are gathered from householdBut a lot of data are gathered from household Allocating household consumption Allocating household consumption

(income) to individual household members (income) to individual household members is a critical part of estimation is a critical part of estimation

Adjusting using aggregate (Adjusting using aggregate (mmacro) controlacro) control


Headship (Thailand, 1996)Headship (Thailand, 1996)

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60 70 80 90+

Age

Hea

dsh

ip R

ate

(per

cen

t)

Economic Head

Self-reported Head

`


MeasuringMeasuring Consumption Consumption

► Underestimation: e.g. British FESUnderestimation: e.g. British FES Using aggregate control mitigates the Using aggregate control mitigates the

problem.problem.

► Home produced items: both income and Home produced items: both income and consumption.consumption.

► Allocation across individuals is difficult Allocation across individuals is difficult ► Estimating Estimating some some profilesprofiles, such as health , such as health

expenditure is also difficult,expenditure is also difficult, part partlyly due due to various sourceto various sourcess of financing. of financing.


Measuring IncomeMeasuring Income

► ““All of the difficulties of measuring All of the difficulties of measuring consumption apply with greater force to the consumption apply with greater force to the measurement of income” (Deaton, p. 29).measurement of income” (Deaton, p. 29). Need detailed information on “transactions” Need detailed information on “transactions”

(inflow and outflow): an enormous task(inflow and outflow): an enormous task Incentive to understate: using aggregate control Incentive to understate: using aggregate control

mitigate the problem.mitigate the problem. Some surveys did not attempt to collect Some surveys did not attempt to collect

information on asset incomeinformation on asset income

► Allocating self-employment income across Allocating self-employment income across individuals is difficult.individuals is difficult.


Data Data CleaningCleaning

► Case by caseCase by case► Find out what data sets are available and Find out what data sets are available and

choose the best onechoose the best one► Detect outliers and examine them Detect outliers and examine them

carefullycarefully► A serious examination is required when A serious examination is required when

inflation matters to check whether actual inflation matters to check whether actual estimation process generateestimation process generatess a variable a variable

► Make variables consistentMake variables consistent► Convert categorical variableConvert categorical variabless to to

continuous variablecontinuous variabless, etc., etc.


Weighting Weighting and Clusteringand Clustering

► Weight should be used in the summary of Weight should be used in the summary of variables/direct variables/direct tabulation/regression/smoothing.tabulation/regression/smoothing.

► Frequency Weights; Frequency Weights; ““fwfw”” indicate replicated indicate replicated data. The weight data. The weight tells us tells us how many how many observations each observation really observations each observation really represents.represents.. tab edu [w=wgt] . tab edu [w=wgt] tab edu [fw=wgt] tab edu [fw=wgt]

► Analytic Weights; Analytic Weights; ““awaw”” are inversely are inversely proportional to the variance of an proportional to the variance of an observation. It is appropriate when you are observation. It is appropriate when you are dealing with data containing averages.dealing with data containing averages.. su edu [w=wgt] . su edu [w=wgt] su edu [aw=wgt] su edu [aw=wgt]. reg wage edu [w=wgt] . reg wage edu [w=wgt] reg wage edu reg wage edu [aw=wgt][aw=wgt]


Weighting and ClusteringWeighting and Clustering (cont’d)(cont’d)►Probability Weights; Probability Weights; ““pwpw”” are the are the

sample weight which is the inverse of sample weight which is the inverse of the probability that this observation was the probability that this observation was sampled.sampled.

. reg wage edu [pw=wgt] . reg wage edu [pw=wgt] reg wage reg wage edu [(a)w=wgt], robustedu [(a)w=wgt], robust

. reg wage edu [pw=wgt], cluster(hhid) . reg wage edu [pw=wgt], cluster(hhid)

reg wage edu [(a)w=wgt], reg wage edu [(a)w=wgt], cluster(hhid)cluster(hhid)


SmoothingSmoothing►Shows the pattern more clearly by Shows the pattern more clearly by

reducing sampling variancereducing sampling variance►Type of smoothingType of smoothing

““lowess” smoothing (Stata) does notlowess” smoothing (Stata) does not allow the incorporation of allow the incorporation of weightweight. . Expanding the data is Expanding the data is computationally burdensome.computationally burdensome.

Friedman’s super smoothing (R) does.Friedman’s super smoothing (R) does.


Rule of smoothingRule of smoothing► Basic components should be smoothed, but not aggregations. For Basic components should be smoothed, but not aggregations. For

example, private health consumption and public health consumption example, private health consumption and public health consumption profiles should be smoothed, but the sum of the two should not be profiles should be smoothed, but the sum of the two should not be smoothed.smoothed.

► The objective is to reduce sampling variance but not eliminate what The objective is to reduce sampling variance but not eliminate what may be “real” features of the data. may be “real” features of the data. Avoid too much smoothing (e.g.Avoid too much smoothing (e.g.,, health health consumption for old ages consumption for old ages)). Use . Use

right bandwidth (span)right bandwidth (span) Public health spending may increase dramatically when individuals reach Public health spending may increase dramatically when individuals reach

an age threshold, e.g., 65. This kind of feature of the data should not be an age threshold, e.g., 65. This kind of feature of the data should not be smoothed away.smoothed away.

Due to unusual high health consumption by newborns, we tend not to Due to unusual high health consumption by newborns, we tend not to smooth health consumption by age 0. This could be done by including smooth health consumption by age 0. This could be done by including estimated unsmoothed health consumption by newborns to the age profile estimated unsmoothed health consumption by newborns to the age profile of smoothed private health consumption by other age groups.of smoothed private health consumption by other age groups.

Only adults (usually ages 15 and older) receive income, pay income taxes Only adults (usually ages 15 and older) receive income, pay income taxes and make familial transfer outflows. Thus, when we smooth these age and make familial transfer outflows. Thus, when we smooth these age profiles, we begin smoothing from the adults, excluding those younger age profiles, we begin smoothing from the adults, excluding those younger age group who do not earn income.group who do not earn income.

However, problem arises when some beginning age group may appear to However, problem arises when some beginning age group may appear to have negative values for these variables. This could be solved by replacing have negative values for these variables. This could be solved by replacing the negative by the unsmoothed values for the beginning age group.the negative by the unsmoothed values for the beginning age group.


Specific commentsSpecific comments

► nta<- read.csv(“sample.csv",header=T) nta<- read.csv(“sample.csv",header=T) *Read in data. Work name is nta, data name *Read in data. Work name is nta, data name is sample.csvis sample.csv

► t1<- supsmu(nta$age,nta$yl,nta$sample, span=0.1)t1<- supsmu(nta$age,nta$yl,nta$sample, span=0.1)► t2<- supsmu(nta$age,nta$yl,nta$sample, span=0.2)t2<- supsmu(nta$age,nta$yl,nta$sample, span=0.2)► t3<- supsmu(nta$age,nta$yl,nta$sample, t3<- supsmu(nta$age,nta$yl,nta$sample,

span=“cv”)span=“cv”)

*Smooth data. Work name is t1 t2 t3.*Smooth data. Work name is t1 t2 t3.► write.csv(c(t1, t2, t3), “result.csv") write.csv(c(t1, t2, t3), “result.csv")

*Write out data using name “result" *Write out data using name “result"


DiscussionDiscussion► Data type/quality varies across countries.Data type/quality varies across countries.► Estimation method could vary across Estimation method could vary across

countries depending on data.countries depending on data.► However,However, some standard some standard procedureprocedure could could

be applied.be applied. Definition Definition EEstimation using weight stimation using weight

SmoothingSmoothing using weight using weight Macro Macro control control Present your work!Present your work!

If some component varIf some component variesies substantially substantially by age, then it is estimateby age, then it is estimatedd separately separately

(education, health, etc)(education, health, etc)

NNational ational TTransfer ransfer AAccountsccounts

The EndThe End

N ational T ransfer A ccounts Data and Estimation Issues May 15, 2009 Sang-Hyop Lee University of...

Documents

Transcript of N ational T ransfer A ccounts Data and Estimation Issues May 15, 2009 Sang-Hyop Lee University of...