library(data.table)
library(foreign)
library(rgdal)
library(spdep)
library(splm)
library(tmap)
load("../tmp/resilience.RData")
# Display options
library(pander)
panderOptions("big.mark", ",")
panderOptions("round", 2)
panderOptions("missing", ".")
panderOptions("table.split.table", 100)
# Helper
"%||%" <- function(a, b) if (!is.null(a)) a else b
# Helper - Combine results from list of `splm` objects
splm.combine <- function(x) data.table(
model=names(x),
do.call(rbind, lapply(x, coef)),
do.call(rbind, lapply(x, `[[`, "arcoef")),
phi=as.numeric(do.call(rbind, lapply(x, function(x) try(x$errcomp[["phi"]])))),
psi=as.numeric(do.call(rbind, lapply(x, function(x) try(x$errcomp[["psi"]])))),
rho=as.numeric(do.call(rbind, lapply(x, function(x) try(x$errcomp[["rho"]])))))
# Helper - AIC function for `spml`, show goodness of fit measure
# at https://stat.ethz.ch/pipermail/r-sig-geo/2016-February/024077.html
godf.spml<-function(object, k=2, criterion=c("AIC", "BIC"), ...) {
s <- summary(object)
l <- s$logLik[1,1]
np <- length(coef(s))
N <- nrow(s$model)
if (criterion=="AIC") {
aic <- -2*l+k*np
names(aic) <-"AIC"
return(aic)
}
if (criterion=="BIC") {
bic <- -2*l+log(N)*np
names(bic) <-"BIC"
if (k!=2) {
warning("parameter <k> not used for BIC")
}
return(bic)
}
}
All biophysical variables were extracted by Mel/Tim (e.g. code at https://github.com/IFPRI/sda-rdata/blob/master/R/data_2016.R#L870) and Beliyou constructed long-term and seasonal summary variables. For the spatial panel analysis we use R package splm
developed by Millo and Piras, 20091. splm
is documented at https://www.r-project.org/conferences/useR-2009/slides/Millo+Piras.pdf and http://facweb.knowlton.ohio-state.edu/pviton/courses/crp87105/millo-piras.pdf. Beliyou also suggested to try a “pooled” spatial regression combining weights at district and household levels.
We need to test for spatial auto-correlation in the sample and test which of the SPEI, drought, temperature, travel time, elevation, and/or rainfall shock variable(s) to include in the model.
splm
methods expect spatial panels in a particular format. Also note that splm
commands for spatial panels expect a balanced dataset (same number of observations over time). Spatial locations should also be stable over time, though might be feasible to define time-dependent weight matrices \(W_t\).
A good discussion of spatial weights and using Rook/Queen contiguity vs. distance (distance thresholds, k-nearest neighbor, or other) approach is in Chapter 9 Bivand 20082.
“The first step is to define which relationships between observations are to be given a non-zero weight; that is to choose the neighbor criterion to be used; the second is to assign weights to the identified neighbor links.[…] analysing areal data is crucially dependent on the choices made in constructing the spatial weights.”
Contiguity weights are usually used with administrative units (e.g. districts), but other definitions are possible. Alternate options can be tested econometrically. From World Bank 20083 “Rook and Queen Contiguity spatial weighting often leads to a very unbalanced structure. Larger units can have more neighbors and small units a smaller number of neighbors. The solution is to set a unique number of neighbors for all areas by creating a k-nearest neighbor weighting matrix. When geo-referenced coordinates are available, the spatial weights can be derived from the distance between different points. Euclidean distance weighting fixes a specified distance and then counts the number of neighbors that fall within that distance.”
On choosing a weight matrix, see also Elhorst, 20144:
“As an alternative to row-normalization, W might be normalized such that the elements of each column sum to one. This type of normalization is sometimes used in the social economics literature (Leenders 2002). Note that the row elements of a spatial weights matrix display the impact on a particular unit by all other units, while the column elements of a spatial weights matrix display the impact of a particular unit on all other units. Consequently, row normalization has the effect that the impact on each unit by all other units is equalized, while column normalization has the effect that the impact of each unit on all other units is equalized.”
“If we know little about the assumed spatial process, we try to avoid moving far from the binary representation (Bavaud, 1998).”
Fom Elhorst, 20145:
“When specifying interaction between spatial units, the model may contain a spatially lagged dependent variable or a spatial autoregressive process in the error term, known as the spatial lag and the spatial error model, respectively.”
“The spatial lag model posits that the dependent variable depends on the dependent variable observed in neighboring units and on a set of observed local characteristics. According to Anselin et al. (2006, p. 6), the spatial lag model is typically considered as the formal specification for the equilibrium outcome of a spatial or social interaction process, in which the value of the dependent variable for one agent is jointly determined with that of the neighboring agents. In the empirical literature on strategic interaction among local governments, for example, the spatial lag model is theoretically consistent with the situation where taxation and expenditures on public services interact with taxation and expenditures on public services in nearby jurisdictions (Brueckner 2003).”
“The spatial error model, on the other hand, posits that the dependent variable depends on a set of observed local characteristics and that the error terms are correlated across space According to Anselin et al. (2006, p. 7), a spatial error specification does not require a theoretical model for a spatial or social interaction process, but, instead, is a special case of a nonspherical error covariance matrix. In the empirical literature on strategic interaction among local governments, the spatial error model is consistent with a situation where determinants of taxation or expenditures on public services omitted from the model are spatially autocorrelated, and with a situation where unobserved shocks follow a spatial pattern. A spatially autocorrelated error term may also be interpreted to reflect a mechanism to correct rent-seeking politicians for unanticipated fiscal policy changes (Allers and Elhorst 2005).”
“The spatial specific effects may be treated as fixed effects or as random effects. In the fixed effects model, a dummy variable is introduced for each spatial unit, while in the random effects model, \(mu\) is treated as a random variable that is independently and identically distributed.”
[…] A related problem of controlling for spatial fixed effects is that any variable that does not change over time or only varies a little cannot be estimated, because it is wiped out by the demeaning transformation. This is the main reason for many studies not controlling for spatial fixed effects.
[…] A compromise solution to the all or nothing way of utilizing the crosssectional component of the data is the random effects model. This model avoids the loss of degrees of freedom incurred in the fixed effects model associated with a relatively large N and the problem that the coefficients of time-invariant variables cannot be estimated. However, whether the random effects model is an appropriate specification in spatial research remains controversial.
The random effects model can be tested against the fixed effects model using Hausman’s specification test (Baltagi 2005, pp. 66-68) available in command sphtest
.
To start with let’s try to replicate and understand Millo’s example.
# Look at the sample data provided in `splm`. In that example a proximity matrix is
# constructed considering all the farms of the same village as neighbours. One can
# expect both village-level heterogeneity and spatial correlation between farms
# belonging to the same village. Spatial dependence is easier to justify for the error
# terms, due to spillovers across neighbouring farms in idiosyncratic factors and
# climate conditions; more difficult to find reasons for the inclusion of a spatial lag
# of the dependent variable, as it seems unrealistic for the outcome in one farm to
# influence those of neighbours.
data(RiceFarms, riceww)
RiceFarms <- data.table(RiceFarms)
RiceFarms[, .N, keyby=time]
# 171 obs in each period
dim(riceww)
# [1] 171 171
RiceFarms[, .N, keyby=.(region, time)]
# => stable # of obs, balanced dataset is used here
# The full model formula
fm <- log(goutput) ~
log(seed) + log(urea) + phosphate +
log(totlabor) + log(size) + I(pesticide > 0) + I(varieties=="high") +
I(varieties=="mixed") + as.factor(region) + I(as.numeric(time) %in% c(1,3,5))
# Make sure we can reproduce table 4. in Millo 2013
errors <- c("semsrre", "sem2srre", "semre", "sem2re", "semsr", "srre", "sem", "re", "sr", "ols")
mod <- lapply(errors, function(x) spreml(fm, data=RiceFarms, w=riceww, errors=x, lag=T))
names(mod) <- errors
# Print results
tmp <- splm.combine(mod)
Error in x$errcomp[["phi"]] : subscript out of bounds
Error in x$errcomp[["phi"]] : subscript out of bounds
Error in x$errcomp[["phi"]] : subscript out of bounds
Error in x$errcomp[["psi"]] : subscript out of bounds
Error in x$errcomp[["psi"]] : subscript out of bounds
Error in x$errcomp[["psi"]] : subscript out of bounds
Error in x$errcomp[["psi"]] : subscript out of bounds
Error in x$errcomp[["rho"]] : subscript out of bounds
Error in x$errcomp[["rho"]] : subscript out of bounds
Error in x$errcomp[["rho"]] : subscript out of bounds
setnames(tmp, c(2:10), c("Y", "seed", "urea", "tsp", "lab", "size", "pest", "high", "mixed"))
pander(tmp[, .SD, .SDcols=c(1:10, 17:20)])
-----------------------------------------------------------------------------------------------
model Y seed urea tsp lab size pest high mixed lambda phi psi rho
-------- ---- ------ ------ ----- ----- ------ ------ ------ ------- -------- ----- ----- -----
semsrre 4.03 0.12 0.13 0 0.23 0.51 -0.01 0.12 0.1 0.17 0.17 0.09 0.67
sem2srre 4 0.12 0.13 0 0.23 0.51 -0.01 0.12 0.1 0.18 0.16 0.09 0.66
semre 4.03 0.12 0.13 0 0.23 0.51 -0.01 0.12 0.1 0.17 0.2 . 0.67
sem2re 3.98 0.12 0.13 0 0.23 0.51 -0.01 0.12 0.1 0.18 0.19 . 0.67
semsr 4.04 0.12 0.13 0 0.23 0.51 -0.01 0.13 0.1 0.17 . 0.2 0.65
srre 2.59 0.12 0.12 0 0.21 0.5 0.02 0.12 0.1 0.4 0.11 0.11 .
sem 4.01 0.12 0.14 0 0.22 0.51 -0.01 0.13 0.09 0.17 . . 0.64
re 2.55 0.12 0.13 0 0.2 0.5 0.03 0.11 0.09 0.4 0.15 . .
sr 2.66 0.12 0.13 0 0.21 0.5 0.01 0.13 0.1 0.38 . 0.18 .
ols 2.65 0.12 0.14 0 0.2 0.51 0.02 0.12 0.08 0.39 0.17 0.09 0.67
-----------------------------------------------------------------------------------------------
Interpretation:
# Load Beliyou's Hhld variables (800MB, 2670 vars!)
hh <- read.dta("./tmp/Combined_4_Mel.12.dta")
# Keep STATA var labels
hh.lbl <- data.table(varCode=names(hh), varLabel=attr(hh, "var.labels"))
hh <- data.table(hh)
# Load spatial features from biophysical workspace
load("../../hc-data/out/2016.09/svyL2Maps_r16.09.RData")
# Keep only the datasets and spatial features we need
rm(list=ls()[!ls() %in% c("hh", "hh.lbl", "gps", "g2", "g2.lbl", "gps.pts", "iso3", "svy")])
# For the panel regressions, we limit obs to
panels <- list(TZA=c("NPS09", "NPS11", "NPS13"), UGA=c("NPS10", "NPS11", "NPS12"))
# Also list vars we want to include in the regression
# STATA: xi:xtreg ${outcome`j'} ${cont`t'`i'`c'`v't`t'} $headcont $hhcont $wealth $bio $round $region
# if $filter /*[pweight=weight_]*/, `p' vce(r)
models <- list(
Y=c("pcexp_pppimp", "pcfoodexp_pppimp"), # take the ln()
X=list(
headcont=c("femhead", "agehead"),
hhcont=c("hhsize", "hhsizesq", "educave"),
wealth=c("landown", "tlu_total", "agwealth_paran", "nonagwealth_paran"), # "electricity"
bio=c("TT20k_hours", "elevation", "lgp"), # "far", "mean_popden2000""
health=c("malariain")
)
)
panels.hh <- hh[(ISO3=="TZA" & survey %in% panels$TZA) | (ISO3=="UGA" & survey %in% panels$UGA)]
rm(hh)
# Tally obs across panels
tmp <- panels.hh[, .N, keyby=.(ISO3, region=regionname, panel=survey)]
tmp <- tmp[, lapply(.SD, paste, collapse=", "), .SDcols=c("N", "panel"), keyby=.(ISO3, region)]
pander(tmp, caption="Obs. across panels")
----------------------------------------------------------------
ISO3 region N panel
------ ----------------------- ------------- -------------------
TZA Arusha 104, 122, 149 NPS09, NPS11, NPS13
TZA Dar es salaam 555, 624, 742 NPS09, NPS11, NPS13
TZA Dodoma 88, 109, 138 NPS09, NPS11, NPS13
TZA Iringa 128, 137, 156 NPS09, NPS11, NPS13
TZA KASKAZINI PEMBA 80, 90, 103 NPS09, NPS11, NPS13
TZA KASKAZINI UNGUJA 72, 77, 89 NPS09, NPS11, NPS13
TZA KUSINI PEMBA 88, 97, 106 NPS09, NPS11, NPS13
TZA KUSINI UNGUJA 32, 47, 50 NPS09, NPS11, NPS13
TZA Kagera 120, 157, 216 NPS09, NPS11, NPS13
TZA Kigoma 104, 130, 172 NPS09, NPS11, NPS13
TZA Kilimanjaro 104, 114, 130 NPS09, NPS11, NPS13
TZA Lindi 150, 180, 218 NPS09, NPS11, NPS13
TZA MJINI/MAGHARIBI UNGUJA 207, 222, 241 NPS09, NPS11, NPS13
TZA Manyara 80, 83, 94 NPS09, NPS11, NPS13
TZA Mara 56, 64, 85 NPS09, NPS11, NPS13
TZA Mbeya 152, 173, 211 NPS09, NPS11, NPS13
TZA Morogoro 112, 135, 188 NPS09, NPS11, NPS13
TZA Mtwara 200, 237, 319 NPS09, NPS11, NPS13
TZA Mwanza 128, 175, 304 NPS09, NPS11, NPS13
TZA Pwani 64, 85, 117 NPS09, NPS11, NPS13
TZA Rukwa 88, 100, 110 NPS09, NPS11, NPS13
TZA Ruvuma 136, 158, 191 NPS09, NPS11, NPS13
TZA Shinyanga 136, 194, 277 NPS09, NPS11, NPS13
TZA Singida 56, 66, 87 NPS09, NPS11, NPS13
TZA Tabora 112, 144, 227 NPS09, NPS11, NPS13
TZA Tanga 112, 124, 163 NPS09, NPS11, NPS13
UGA Kampala 232, 179, 186 NPS10, NPS11, NPS12
UGA Central without Kampala 685, 636, 653 NPS10, NPS11, NPS12
UGA Eastern 677, 633, 645 NPS10, NPS11, NPS12
UGA Northern 703, 678, 732 NPS10, NPS11, NPS12
UGA Western 626, 507, 600 NPS10, NPS11, NPS12
----------------------------------------------------------------
Table: Obs. across panels
# TODO Append X,Y coordinates
# Verify unique records first
setkey(panels.hh, ISO3, survey, round, cluster, hhid)
panels.hh[duplicated(panels.hh), unique(hhid)]
# Balance panels (we need stable count of obs. across survey rounds)
pander(panels.hh[balanced==1, .N, keyby=.(ISO3, survey, svyCode)], caption="Obs. across panels, balanced")
# Summarize and graph/map regressor vars, check for missing data
# Try to merge features and attributes across panels
# Using SPEI, UDEL, and CHIRPS, map skewness of distribution
# Add drought median duration
# Map bio shock variables
# Moran I test on all regressors and outcome variables
# Collect results
The LSMS-ISA panels provide GPS coords for all households. Intuitively we could try a few different approaches:
\(W_1\) seems more common in the socio-economic literature. \(W_3\) is more straightforward but Euclidian distances could be misleading. \(W_2\) requires more work (e.g. can we use Google Maps Distance Matrix API to compute realistic travel times between all household locations?
# Generate NxN travel time matrix using Google Distance API
# API quotas
# entries per day: 2,500
# requests per 100 seconds: 10,000
# requests per 100 seconds per user: Unlimited
# results are in seconds
# Hit the API survey by survey, start with most recent panel
i <- svy[3]
tmp <- data.table(gps.pts@data[gps.pts$svyCode==i,])
# Init matrix
panels.tt <- matrix(as.numeric(NA), nrow=nrow(tmp), ncol=nrow(tmp))
dimnames(panels.tt) <- c(tmp$hhid, tmp$hhid)
# Helper - Google maps distance matrix API
gmapsapi <- function(o, d) {
require(httr)
url <- "https://maps.googleapis.com/maps/api/distancematrix/json"
out <- GET(url, query=list(origins=o, destinations=d, mode="driving", key=api_key))
out <- jsonlite::fromJSON(content(out, as="text"))
return(out$rows$elements[[1]]$duration$value)
}
# Hit the API and collect responses
for (i in 1:nrow(tmp)[1]) {
N <- seq(i, nrow(tmp), 200)
for (j in 1:length(N)) {
f <- N[j]
t <- N[j+1]-1
panels.tt[i, f:t] <- gmapsapi(
tmp[i, paste(Y_mod, X_mod, sep=",")],
tmp[f:t, paste(Y_mod, X_mod, sep=",", collapse="|")])
}
}
#=> works but returns quite a few NAs
# Map results
l <- data.table(tmp[1, X_mod], tmp[1, Y_mod], tmp[-1, X_mod], tmp[-1,Y_mod])
l <- lapply(1:nrow(l), function(i) list(L=Line(matrix(unlist(l[i,]), ncol=2, byrow=T)), i=i))
l <- SpatialLines(lapply(l, function(E) Lines(list(E$L),as.character(E$i))),
proj4string=CRS("+init=epsg:4326"))
l <- SpatialLinesDataFrame(l, data.frame(time=panels.tt[1,-1]/(60*60)))
tm_shape(World) + tm_polygons() +
tm_shape(l[1:2297,], is.master=T) +
tm_lines(col="time", colorNA="white", lwd=.4, alpha=.4, title.col="travel (hours)") +
tm_style_grey()
Google API works but returns a lot of missing values between pairs of locations, seems difficult to avoid. Another approach is to use an arbitrary distance to define neighbors (e.g. 50km).
# Generate spatial neighbour list
nb1 <- knn2nb(knearneigh(coords, k=50), row.names=paste(g2.nb$ISO3, g2.nb$rn, sep="."))
summary(nb1)
# Inverse distance matrix
w1 <- lapply(nb1, function(x) 1/(x/1000))
w1 <- nb2listw(nb1, glist=w1, style="B")
summary(w1$weights)
summary(sapply(w1$weights, sum))
# Generate spatial neighbour list
nb2 <- poly2nb(g2.nb, row.names=paste(g2.nb$ISO3, g2.nb$rn, sep="."))
summary(nb2)
# Distance matrix
w2 <- nb2mat(nb2, style="W", zero.policy=T)
In short:
spfml
command estimates both fixed effects spatial lag and error models (with different methods to calculate the determinants, default eigen
) and different effects:
pooled
constant term onlyspfe
cross-sectional specific effectstpfe
time-period specific effectssptpfe
include both spatial and temporal fixed effectsspfml
returns residuals, a table of estimated coefficients (rho
is the coeff of the spatially lagged dependent variable). Fixed effects can be extracted using effects(res)
that returns the type of effects with significance levels and the constant term.
spreml
command is used to generate spatial random effects with several error options:
semsrre
full (most general) modelsemsr
serially and spatially correlated disturbances, no random effectssrre
serial correlation and random effectssemre
exclude serial correlationre
traditional random effects modelsr
panel regression with serially correlated errorssem
pooled model with spatially autocorrelated residualsCommand bsktest
is used to test for random effects and spatial error correlation (e.g. Baltagi, Song and Koh SLM1 marginal test).
Command bsjktest
is used to test for joint, marginal and conditional tests for random effects, serial and spatial error correlation (takes a model formula as input).
# Subsample to "rural != 0 & agri_hh != 0"
# Start iterating across model specfications
# Collect results
# Save snapshot
rm(tmp, RiceFarms, riceww, x, i, j)
save.image("./tmp/resilience.RData")
Millo, G and Piras, G. (2009) “splm: Spatial Panel data models in R”, Journal of Statistical Software, Vol. VV, Issue II. Online at http://facweb.knowlton.ohio-state.edu/pviton/courses/crp87105/millo-piras.pdf↩
Bivand, R.S., Pebesma, E.j., and Gomez-Rubio, V. (2008) “Applied Spatial Analysis with R”. Springer. Online at http://gis.humboldt.edu/OLM/r/Spatial%20Analysis%20With%20R.pdf↩
Elbers, C., Lanjouw, P. and Leite, P.G. (2008) “Brazil within Brazil: Testing the Poverty Map Methodology in Minas Gerais”, World Bank Policy Research Working Paper #4513. Online at http://documents.worldbank.org/curated/en/941401468231893568/pdf/wps4513.pdf↩
Elhorst, J.P. (2014) Spatial Panel Data Models. Chapter C.2 in “Spatial Econometrics from Cross-Sectional Data to Spatial Panels” Spinger, 2014. University of Groningen. On-line at http://regroningen.nl/elhorst/doc/Spatial%20Panel%20Data%20Models.pdf↩
Elhorst, J.P. (2014) Spatial Panel Data Models. Chapter C.2 in “Spatial Econometrics from Cross-Sectional Data to Spatial Panels” Spinger, 2014. University of Groningen. On-line at http://regroningen.nl/elhorst/doc/Spatial%20Panel%20Data%20Models.pdf↩