Brusilovskiy_A Brief Introduction to Spatial Regression

download Brusilovskiy_A Brief Introduction to Spatial Regression

of 29

Transcript of Brusilovskiy_A Brief Introduction to Spatial Regression

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    1/29

    A Brief Introduction to Spatial

    Regression

    Eugene Brusilovskiy

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    2/29

    2

    Outline

    Review of Correlation

    OLS Regression

    Regression with a non-normal dependentvariable

    Spatial Regression

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    3/29

    3

    Correlation Defined as a measure of how much two variables X and Y change

    together

    Dimensionless measure: A correlation between two variables is a single number that can range

    from -1 to 1, with positive values close to one indicating a strong directrelationship and negative values close to -1 indicating a strong inverse

    relationship E.g., a positive correlation between income and years of schooling indicatesthat more years of schooling would correspond to greater income (i.e., anincrease in the years of schooling is associated with an increase in income)

    A correlation of 0 indicates a lack of relationship between the variables

    Generally denoted by the Greek letter

    Pearson Correlation: When the variables are normally distributed Spearman Correlation: When the variables arent normally distributed

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    4/29

    4

    Some Remarks

    In practice, we rarely see perfect positive or negativecorrelations (i.e., correlations of exactly 1 or -1)

    Correlations are those higher than 0.6 (or lower than -0.6)are considered to be strong

    There might be confounding factors that explain a strongpositive or negative correlation between variables E.g., volume of ice cream consumption might be correlated with

    crime rates. Why? Both tend to be high when the temperatures are warmer!

    The correlation between two seemingly unrelated variablesdoes not always equal exactly zero (although it will often beclose to it)

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    5/29

    5

    Correlation does not imply causation!

    Source: https://reader009.{domain}/reader009/html5/0406/5ac6fdce34f2a/5ac6fdd00c191.png

    http://imgs.xkcd.com/comics/correlation.pnghttp://imgs.xkcd.com/comics/correlation.png
  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    6/29

    6

    Regression

    A statistical method used to examine the relationship between a variableof interest (dependent variable) and one or more explanatory variables(predictors) Strength of the relationship

    Direction of the relationship (positive, negative, zero)

    Goodness of model fit

    Allows you to calculate the amount by which your dependent variablechanges when a predictor variable changes by one unit (holding all otherpredictors constant)

    Often referred to as Ordinary Least Squares (OLS) regression Regression with one predictor is called simple regression

    Regression with two or more predictors is called multiple regression

    Available in all statistical packages Just like correlation, if an explanatory variable is a significant predictor of

    the dependent variable, it doesnt imply that the explanatory variable is acause of the dependent variable

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    7/29

    7

    Example

    Assume we have data on median income and median housevalue in 381 Philadelphia census tracts (i.e., our unit ofmeasurement is a tract)

    Each of the 381 tracts has information on income (call it Y)

    and on house value (call it X). So, we can create a scatter-plotof Y against X. Through this scatter plot, we can calculate the equation of the line

    that best fits the pattern (recall: Y=mx+b, where m is the slope and b isthe y-intercept)

    This is done by finding a line such that the sum of the squared

    (vertical) distances between the points and the line is minimized Hence the term ordinary least squares

    Now, we can examine the relationship between these twovariables

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    8/29

    8

    esidualRegressionR

    LineFitBestofSlope

    terceptInY

    ValueHouseIncome

    1

    0

    10

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    9/29

    9

    We can easily extend this to cases with 2+ predictors

    The coefficient of each predictor may be interpreted as the amount bywhich the dependent variable changes as the independent variableincreases by one unit (holding all other variables constant)

    ,...22110 nnXXXIncome

    When we have n>1 predictors, rather than getting a line in 2

    dimensions, we get a line in n+1dimensions (the +1 accounts for the

    dependent variable)

    Each independent variable will have its own slope coefficient which will

    indicate the relationship of that particular predictor with the

    dependent variable, controlling for all other independent variables in the

    regression.

    The equation of the best fit line becomes

    where

    )(

    ..1..10

    noiserandombeshouldesidualsR

    nVariablesoftsCoefficien

    InterceptY

    n

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    10/29

    10

    An Example with 2 Predictors: Income as a function of

    House Value and Crime

    TheftsValueHouseIncome 210

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    11/29

    11

    The so-calledp-value associated with the variable For any statistical method, including regression, we are testing some

    hypothesis. In regression, we are testing the null hypothesis that thecoefficient (i.e., slope) is equal to zero (i.e., that the explanatory variableis not a significant predictor of the dependent variable).

    Formally, thep-value is the probability of observing the value of asextreme (i.e., as different from 0 as its estimated value is) when in reality itequals to zero (i.e., when the Null Hypothesis holds). If this probability issmall enough (generally, p

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    12/29

    12

    The sign of the coefficient of the independentvariable (i.e., the slope of the regression line)

    One coefficient per independent variable Indicates whether the relationship between the

    dependent and independent variables is positiveor negative

    We should look at the sign when the coefficient isstatistically significant (i.e., significantly differentfrom zero)

    Some Basic Regression Diagnostics

    (Contd)

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    13/29

    13

    Some Basic Regression Diagnostics

    (Contd) R-squared(AKA Coefficient of Determination): the percent of

    variance in the dependent variable that is explained by thepredictors

    In the single predictor case, R-squared is simply the squareof the correlation between the predictor and dependentvariable

    The more independent variables included, the higher the

    R-squared Adjusted R-squared: percent of variance in the dependent

    variable explained, adjusted by the number of predictors

    One R-squared for the regression model

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    14/29

    14

    Some (but not all) regression

    assumptions1. The dependent variable should be normally distributed (i.e.,

    the histogram of the variable should look like a bell curve)- Ideally, this will also be true of independent variables, but this is not

    essential. Independent variables can also be binary (i.e., have two values,

    such as 1 (yes) and 0 (no))2. The predictors should not be strongly correlated with each

    other (i.e., no multicollinearity)

    3. Very importantly, the observations should be independent ofeach other. (The same holds for regression residuals). If this

    assumption is violated, our coefficient estimates could bewrong!

    General rule of thumb: 10 observations per independent

    variable

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    15/29

    15

    N=140

    An Example of a Normal Distribution

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    16/29

    16

    Data Transformations

    Sometimes, it is possible to transform a variables distribution by

    subjecting it to some simple algebraic operation.

    The logarithmic transformation is the most widely used to achieve

    normality when the variable ispositively skewed(as in the image on

    the left below) Analysis is then performed on the transformedvariable.

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    17/29

    17

    Additional Regression Methods Logistic regression/Probit regression

    When your dependent variable is binary (i.e., has two possibleoutcomes).

    E.g., Employment Indicator (Are you employed? Yes/No)

    Multinomial logistic regression

    When your dependent variable is categorical and has more thantwo categories

    E.g., Race: Black, Asian, White, Other

    Ordinal logistic regression

    When your dependent variable is ordinal and has more than

    two categories E.g., Education: (1=Less than High School, 2=High School, 3=More

    than High School)

    Poisson regression

    When your dependent variable is a count

    E.g., Number of traffic violations (0, 1, 2, 3, 4, 5, etc)

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    18/29

    18

    Spatial Autocorrelation

    Recall:

    There is spatial autocorrelation in a variable if observations that are

    closer to each other in space have related values (Toblers Law)

    One of the regression assumptions is independence of observations.If this doesnt hold, we obtain inaccurate estimates of the

    coefficients, and the error term contains spatial dependencies

    (i.e., meaningful information), whereas we want the error to not be

    distinguishable from random noise.

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    19/29

    19

    Imagine a problem with a spatial component

    This example is

    obviously a

    dramatization,

    but nonetheless,

    in many spatial

    problems pointswhich are close

    together have

    similar values

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    20/29

    20

    But how do we know if spatial

    dependencies exist? Morans I (1950) a rather old and perhaps the most

    widely used method of testing for spatialautocorrelation, or spatial dependencies

    We can determine a p-value for Morans I (i.e., an indicatorof whether spatial autocorrelation is statistically significant). For more on Morans I, see http://en.wikipedia.org/wiki/Moran%27s_I

    Just as the non-spatial correlation coefficient, ranges from -1to 1

    Can be calculated in ArcGIS Other indices of spatial autocorrelation commonly

    used include: Gearys c (1954)

    Getis and Ords G-statistic (1992) For non-negative values only

    http://en.wikipedia.org/wiki/Moran%27s_Ihttp://en.wikipedia.org/wiki/Moran%27s_I
  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    21/29

    21

    So, when a problem has a spatial

    component, we should:

    - Run the non-spatial regression

    - Test the regression residuals for spatial autocorrelation,using Morans I or some other index

    - If no significant spatial autocorrelation exists, STOP.Otherwise, if the spatial dependencies are significant, use aspecial model which takes spatial dependencies into

    account.

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    22/29

    22

    Spatial Regression Models

    A spatial lag (SL) model Assumes that dependencies exist directly among the

    levels of the dependent variable That is, the income at one location is affected by the income at

    the nearby locations

    A lag term, which is a specification of income at nearbylocations, is included in the regression, and its coefficient and p-value are interpreted as for the independent variables.

    As in OLS regression, we can include independentvariables in the model.

    Whereas we will see spatial autocorrelation in OLSresiduals, the SL model should account for spatialdependencies and the SL residuals would not beautocorrelated,

    Hence the SL residuals should not be distinguishable from randomnoise (i.e., have no consistent patterns or dependencies in them)

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    23/29

    23

    OLS Residuals vs. SL Residuals

    Non-random patterns and clustering Random Noise

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    24/29

    24

    But how is spatial proximity defined?

    For each point (or areal unit), we need to identify its spatial relationshipwith all the other points (or areal units). This can be done by looking at the(inverse of the) distance between each pair of points, or in a number ofother ways: A binary indicator stating whether two points (or census tract centroids) are

    within a certain distance of each other (1=yes, 0=no)

    A binary indicator stating whether point A is one of the ___ (1, 5, 10, 15, etc)

    nearest neighbors of B (1=yes, 0=no) For areal datasets, the proportion of the boundary that zone 1 shares with

    zone 2, or simply a binary indicator of whether zone 1 and 2 share a border(1=yes, 0=no)

    Etc, etc, etc

    When we have n observations, we form an n x n table (called a weight

    matrixor a link matrix) which summarizes all the pairwise spatialrelationships in the dataset

    These weight matrices are used in the estimation of spatial regression(and the calculation of Morans I).

    Unless we have compelling reasons not to do so, its generally a good ideato see whether our results hold with different types of weight matrices

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    25/29

    25

    Assume we have a map with 10 Census tracts

    Point # 1 2 3 4 5 6 7 8 9 10

    1 0 1 0 0 1 0 0 0 0 12 1 0 0 1 0 1 0 0 0 0

    3 0 0 0 1 1 1 1 0 0 0

    4 0 1 1 0 1 0 1 0 1 0

    5 1 0 1 1 0 1 1 1 0 0

    6 0 1 1 0 1 0 1 0 0 0

    7 0 0 1 1 1 1 0 0 0 1

    8 0 0 0 0 1 0 0 0 1 1

    9 0 0 0 1 0 0 0 1 0 1

    10 1 0 0 0 0 0 1 1 1 0

    The hypothetical weight matrix below indicates whether any given Census tract

    shares a boundary with another tract. 1 means yes and 0 means no. Forinstance, tracts 3 and 6 do share a boundary, as indicated by the blue 1s.

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    26/29

    26

    Now, we need a software package that

    can Run the good old OLS regression model

    Create a weight matrix

    Test for spatial autocorrelation in OLS residuals Run a spatial lag model (or some other spatial

    model)

    Such packages do exist!

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    27/29

    27

    GeoDa

    A software package developed by Luc Anselin

    Can be downloaded free of charge (for membersof educational and research institutions) at

    https://www.geoda.uiuc.edu/ Has a user-friendly interface

    Accepts ESRI shapefiles as inputs

    Is able to perform a number of basic GISoperations in addition to running thesophisticated spatial statistics models

    https://www.geoda.uiuc.edu/https://www.geoda.uiuc.edu/
  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    28/29

    Other Spatial Regression Models

    Spatial Error (can be implemented in GeoDa)

    Geographically-Weighted Regression (can be

    run in ArcGIS 9.3)

    These methods also aim to account for spatialdependencies in the data

  • 7/28/2019 Brusilovskiy_A Brief Introduction to Spatial Regression

    29/29

    29

    Some References: Spatial Regression

    Bailey, T.C. and Gatrell, A.C. (1995). InteractiveSpatial Data Analysis. Addison WesleyLongman, Harlow, Essex.

    Cressie, N.A.C. (1993). Statistics for SpatialData. (Revised Edition). Wiley, John & Sons,Inc.

    LeSage, J. and Pace K.R. (2009). Introductionto Spatial Econometrics. CRC Press/Taylor &Francis Group.