This is my first post here, but I've benefited a lot from this forum's results popping up in google search results.
I've been teaching myself semi-parametric regression using penalized splines. Basically, $$ \hat{y} = X(X'X + \lambda D)^{-1}X'y $$ where X is full of knots in addition to my original variables, $\lambda$ is a penalty term applied to the knots. (See e.g. Wood 2006 or Ruppert, Wand & Carrol 2003 if you want to learn this stuff).
My data has cluster-wise autocorrelation, and probably heteroskedasticity as well. My training is in econometrics, so naturally I want to represent the variance-covariance matrix as $$ X(X'X + \lambda D)^{-1}\left(\displaystyle\sum_j X'_j \epsilon_j \epsilon_j'X\right)(X'X + \lambda D)^{-1}X' $$ The above assumes independence between clusters but not within them, and doesn't assume homoskedasticity. This should give me correct se's.
My problem is that I don't just care about se's. With semi-parametrics, I need confidence intervals on plots.
EDIT some time later:
I've developed a quasi-brute-force way of doing this by hacking mgcv
. The general approach is to calculate my own parameter covariance matrix and stuff it into mgcv
's. It is then used by other functions to calculate CI's and give CI's on plots This is very much a work in progress and I'm not confident in it yet, but I'm putting it up in hopes it might be a first step towards something that works, invite feedback, and perhaps eventually be useful to others. Here it is as it stands:
Another EDIT: The function now tries to get the full penalty matrix by inverting the parameter covariance matrix and the subtracting off $X'X$. Weirdly, the parameter covariance matrix supplied by mgcv
is often not invertible. I have no idea why this would be the case, after removing auto-dropped redundant dummies. Anybody have any ideas here? I'm using generalized inverses as a hack for the moment, and I'm not sure exactly what sort of distortion that will add to my results.
clusterGAM = function(m,clustvar){
require(MASS)
mm = predict(m,type='lpmatrix') #Get model X matrix
mm = mm[,colnames(mm) %in% names(coefficients(m))[rowSums(m$Vp)!=0]] #Cut out coefficients from model matrix that get auto-dropped due to collinearity
meat = matrix(rep(0,ncol(mm)*ncol(mm)),ncol(mm)) #Pre-fill a meat matrix
for (i in 1:length(unique(clustvar))){#This sums over the clusters
X = mm[clustvar == unique(clustvar)[i],]
e = residuals(m)[clustvar == unique(clustvar)[i]]
meat = meat+t(X) %*% e %*% t(e) %*% X
}
M <- length(unique(clustvar))
N <- length(clustvar)
K <- sum(m$edf) #use EDF instead of K
dfc <- (M/(M-1))*((N-1)/(N-K))#Degrees of freedom correction
Vp = vcov(m,dispersion=1)
Vp = Vp[rowSums(m$Vp)!=0,rowSums(m$Vp)!=0]
S = tryCatch(solve(Vp) - t(mm)%*%mm, error=function(e) e, finally=NA)
if(inherits(S, "error")) {
S = ginv(Vp) - t(mm)%*%mm
print("Warning: Generalized inverse used to extract the penalty matrix")
}
vcv = tryCatch(dfc*solve(t(mm)%*%mm+S)%*%meat %*% solve(t(mm)%*%mm+S), error=function(e) e, finally=NA)
if(inherits(vcv, "error")) {
vcv = dfc*ginv(t(mm)%*%mm+S)%*%meat %*% ginv(t(mm)%*%mm+S)
print("Warning: Generalized inverse used to calculate the Vp matrix")
} #Get the vcv
if (length(which(rowSums(m$Vp)==0)!=0)){ #If any zero rows got knocked out, put them back in
zeros = which(rowSums(m$Vp)==0)
for (i in 1:length(zeros)){
vcv = rbind(vcv[0:(zeros[i]-1),],0,vcv[(zeros[i]):nrow(vcv),])
vcv = cbind(vcv[,0:(zeros[i]-1)],0,vcv[,(zeros[i]):ncol(vcv)])
}
}
m$Vp = vcv #stuff the new vcv into the place where the plots and the summary gets made from
m
}
*edited to be able to handle tensors
Some data to run it with
load(broken link)
head(sdat)
N = nrow(sdat)
#set.seed(417)
sdat$x = with(sdat,log(y+1+runif(N)*T)+rexp(N)) #Making a fake smooth covariate
plot(sdat$x,sdat$y)
m = gam(y~ID+G:t+T+s(x)-1,data=sdat)#LSDV panel model with time x group FEs and one smooth function
par(mfrow=c(1,2))
plot(m,ylim=c(-.75,2))
summary(m)
m = clusterGAM(m,sdat$G)
plot(m,ylim=c(-.75,2)) #the confidence intervals shrink a bit. This seems counterintuitive.
summary(m) #SE on T gets a little bit bigger, as it probably should if you account for repeated observations
#Now try it with a tensor
sdat$z = with(sdat,y+x+rnorm(N))
m = gam(y~ID+G:t+T+te(x,z)-1,data=sdat)
summary(m)
vis.gam(m,view=c('x','z'),se=2,theta=150,phi=20)
m = clusterGAM(m,sdat$G)
vis.gam(m,view=c('x','z'),se=2,theta=150,phi=20)
It could be broken either by my imperfect understanding of the math involved, or the internal workings of mgcv
. As it stands, the standard errors on parametric dummies are inflated in ways that seem reasonable, but confidence intervals on smooth terms seem like they get shrunk, which seems wrong.
Any comments towards improving this would be greatly appreciated.
As a final note: I'm not a stats PhD. I'm a grad student doing applied work, and probably spending far more time than I should letting myself get bothered by functional form assumptions.