15
$\begingroup$

This is my first post here, but I've benefited a lot from this forum's results popping up in google search results.

I've been teaching myself semi-parametric regression using penalized splines. Basically, $$ \hat{y} = X(X'X + \lambda D)^{-1}X'y $$ where X is full of knots in addition to my original variables, $\lambda$ is a penalty term applied to the knots. (See e.g. Wood 2006 or Ruppert, Wand & Carrol 2003 if you want to learn this stuff).

My data has cluster-wise autocorrelation, and probably heteroskedasticity as well. My training is in econometrics, so naturally I want to represent the variance-covariance matrix as $$ X(X'X + \lambda D)^{-1}\left(\displaystyle\sum_j X'_j \epsilon_j \epsilon_j'X\right)(X'X + \lambda D)^{-1}X' $$ The above assumes independence between clusters but not within them, and doesn't assume homoskedasticity. This should give me correct se's.

My problem is that I don't just care about se's. With semi-parametrics, I need confidence intervals on plots.

EDIT some time later:
I've developed a quasi-brute-force way of doing this by hacking mgcv. The general approach is to calculate my own parameter covariance matrix and stuff it into mgcv's. It is then used by other functions to calculate CI's and give CI's on plots This is very much a work in progress and I'm not confident in it yet, but I'm putting it up in hopes it might be a first step towards something that works, invite feedback, and perhaps eventually be useful to others. Here it is as it stands:

Another EDIT: The function now tries to get the full penalty matrix by inverting the parameter covariance matrix and the subtracting off $X'X$. Weirdly, the parameter covariance matrix supplied by mgcv is often not invertible. I have no idea why this would be the case, after removing auto-dropped redundant dummies. Anybody have any ideas here? I'm using generalized inverses as a hack for the moment, and I'm not sure exactly what sort of distortion that will add to my results.

clusterGAM = function(m,clustvar){
    require(MASS)
    mm = predict(m,type='lpmatrix') #Get model X matrix
    mm = mm[,colnames(mm) %in% names(coefficients(m))[rowSums(m$Vp)!=0]] #Cut out coefficients from model matrix that get auto-dropped due to collinearity
	meat = matrix(rep(0,ncol(mm)*ncol(mm)),ncol(mm)) #Pre-fill a meat matrix
	for (i in 1:length(unique(clustvar))){#This sums over the clusters
		X = mm[clustvar == unique(clustvar)[i],]
		e = residuals(m)[clustvar == unique(clustvar)[i]]
		meat = meat+t(X) %*% e %*% t(e) %*% X
		}
	M <- length(unique(clustvar))
	N <- length(clustvar)
	K <- sum(m$edf) #use EDF instead of K
    dfc <- (M/(M-1))*((N-1)/(N-K))#Degrees of freedom correction
    Vp = vcov(m,dispersion=1)
    Vp = Vp[rowSums(m$Vp)!=0,rowSums(m$Vp)!=0]

    S = tryCatch(solve(Vp) - t(mm)%*%mm, error=function(e) e, finally=NA)
    if(inherits(S, "error")) {
        S = ginv(Vp) - t(mm)%*%mm
        print("Warning:  Generalized inverse used to extract the penalty matrix")
        }   
    vcv = tryCatch(dfc*solve(t(mm)%*%mm+S)%*%meat %*% solve(t(mm)%*%mm+S), error=function(e) e, finally=NA)
    if(inherits(vcv, "error")) {
        vcv = dfc*ginv(t(mm)%*%mm+S)%*%meat %*% ginv(t(mm)%*%mm+S)
        print("Warning:  Generalized inverse used to calculate the Vp matrix")
        } #Get the vcv
    if (length(which(rowSums(m$Vp)==0)!=0)){	#If any zero rows got knocked out, put them back in
		zeros = which(rowSums(m$Vp)==0)
		for (i in 1:length(zeros)){
			vcv = rbind(vcv[0:(zeros[i]-1),],0,vcv[(zeros[i]):nrow(vcv),])
			vcv = cbind(vcv[,0:(zeros[i]-1)],0,vcv[,(zeros[i]):ncol(vcv)])
			}
		}
	m$Vp = vcv #stuff the new vcv into the place where the plots and the summary gets made from
    m
    }

*edited to be able to handle tensors

Some data to run it with

load(broken link)
head(sdat)
N = nrow(sdat)
#set.seed(417)
sdat$x = with(sdat,log(y+1+runif(N)*T)+rexp(N)) #Making a fake smooth covariate
    plot(sdat$x,sdat$y)
    m = gam(y~ID+G:t+T+s(x)-1,data=sdat)#LSDV panel model with time x group FEs and one smooth function
    par(mfrow=c(1,2))
    plot(m,ylim=c(-.75,2))
    summary(m)
    m = clusterGAM(m,sdat$G)
plot(m,ylim=c(-.75,2)) #the confidence intervals shrink a bit.  This seems counterintuitive.  
summary(m) #SE on T gets a little bit bigger, as it probably should if you account for repeated observations

#Now try it with a tensor
sdat$z = with(sdat,y+x+rnorm(N))
m = gam(y~ID+G:t+T+te(x,z)-1,data=sdat)
summary(m)
vis.gam(m,view=c('x','z'),se=2,theta=150,phi=20)
m = clusterGAM(m,sdat$G)
vis.gam(m,view=c('x','z'),se=2,theta=150,phi=20)

It could be broken either by my imperfect understanding of the math involved, or the internal workings of mgcv. As it stands, the standard errors on parametric dummies are inflated in ways that seem reasonable, but confidence intervals on smooth terms seem like they get shrunk, which seems wrong.

Any comments towards improving this would be greatly appreciated.

As a final note: I'm not a stats PhD. I'm a grad student doing applied work, and probably spending far more time than I should letting myself get bothered by functional form assumptions.

$\endgroup$
5
  • $\begingroup$ You state that "there are no residuals over data points where I'd like to make a prediction". Is your ultimate goal to make a predictive model / to use this in the future to make guesses about $y$ when you know $X$? Splines are very flexible, & can be quite useful, but extra caution should be taken if you are going to want to interpolate or (worse) extrapolate when you are using splines. $\endgroup$ Commented Nov 30, 2012 at 20:42
  • $\begingroup$ Use a wild bootstrap. Let me know if you'd like me to elaborate. $\endgroup$ Commented Nov 30, 2012 at 20:57
  • $\begingroup$ gung: no, not a predictive model as such. ultimately I just want to plot the estimated functions and add confidence bands around them. the functions are straightforward to estimate, but my problem is with the confidence intervals. $\endgroup$ Commented Nov 30, 2012 at 21:32
  • $\begingroup$ One problem, as I see it, is that your "plot" varies with all the non-spline(d) variables. You don't really have a curve to plot, as it were, you have many curves depending upon the values of the other variables. Does this make sense? (Not sure if I've understood the issue correctly...) $\endgroup$
    – jbowman
    Commented Dec 1, 2012 at 0:39
  • $\begingroup$ yes, indeed I have many plots. Basically what one does is to take the chunk of the vcv that corresponds to the variable in question to construct the confidence intervals for that variable. That chunk is obviously influenced by other variables during its estimation. The Ruppert et al book does a good job of making this stuff accessible -- I'd recommend it for a more thorough explanation. $\endgroup$ Commented Dec 3, 2012 at 5:15

0