2
$\begingroup$

I am fitting a qGAM model via the qgam package in R; modelling the median of my response Y along covariates.

It is of the form:

library(qgam)
b<-qgam(Y~ te(X1,X2)+s(X3)+s(X4)+factor1+factor2,qu=0.5,data=anydata)

How would this be transferred to a mathematical formula description? is this correct?

$$ q_{0.5}(Y_{i}) = f(X1,X2)_{i}+f(x3)_{i}+f(X4)_{i}+factor1_{i,n}+factor2_{i,s}+b_{n0,s0}+e_{i} $$

factor1 represents a factor with n levels and factor 2 a factor with s levels both modelled as fixed effects, b0 the "reference level" model intercept (both reference groups combined?!, do I have to include this anyways?) and e the error term. i is the data point index. I am very unsure with the indices. Do I have to add group indices to Y,e and the other terms as well and do I have to combine factor1 and factor 2 into one factor interaction term ?

$\endgroup$

1 Answer 1

5
$\begingroup$

The way to represent models can vary a lot with the application, and I'm not sure if there is any name rather than "vector" form or "observation level" form. We need to know if we are numbering the individuals within groups or not. This only makes sense if it was a designed study with blocks (one categorical variable) and a treatment factor (a second categorical variable), or some other formulation, such as in a repeated measurements study.

But one way I would write is the following, together with the rationale behind it:

First, we note that we have an intercept in the formulation. Second, there are two categorical variables, factor 1 and factor 2. Meaning we will need two indexed parameters representing them, and two indices are need to correctly identify an observation. Third, we have two measurements at the individual level that are combined via splines.

Let us assume we have numbered the individuals/observations regardless of the group. So we have the index of individuals as $i = 1,...,n$. Also let $j = 1, ..., m$ be the factors of the first categorical variable and $k = 1, ..., l$ be the factors of the second categorical variable. Lastly, let $x_i$ and $z_i$ be the value of covariate $x$ and $z$ for individual $i$. If we model, initially, that the continuous variables linearly and with interaction taken into account, we can write it like:

$$ \text{q}_{0.5}(Y_{ijk}) = \alpha + \gamma_{j} + \delta_{k} + \beta_1 x_i + \beta_2 z_i + \beta_3 x_i z_i $$

Where $i = 1,...,n$, $k = 1, ..., l$ and $j = 1, ..., m$. In order to have the model identifiable, we need to choose which factors from the first and second categorical variables to be the reference level. So, we have that $\gamma_1 = \delta_1 = 0$ just in order to have the model identifiable.

If we model the continuous variables as splines, assuming we have only one function for every individual (it doesn't vary or "interact" with another group) we can have the following formulation, as proposed in your question:

$$ \text{q}_{0.5}(Y_{ijk}) = \alpha + \gamma_{j} + \delta_{k} + \text{f}_1(x_i) + \text{f}_2(z_i) + \text{t}(x_i, z_i) $$

Still using the same restrictions $\gamma_1 = \delta_1 = 0$. To add an interaction between the first and second categorical variables, we are essentially adding another parameter representing it, ending up with:

$$ \text{q}_{0.5}(Y_{ijk}) = \alpha + \gamma_{j} + \delta_{k} + \Delta_{jk} + \text{f}_1(x_i) + \text{f}_2(z_i) + \text{t}(x_i, z_i) $$

$\endgroup$
3
  • 1
    $\begingroup$ thanks for that extensive and clear answer! just one add: Do I have to add an error term and if yes, will this become +ei or +eijk then. $\endgroup$
    – MriRo
    Commented Jan 19, 2021 at 23:43
  • 1
    $\begingroup$ I did forget about the error term! So, it depends on the probabilistic structure of the model. In this case, the error term is $e_i$ since I don't think qgams allow for group-based variance terms. If any of the categorical variables could be modeled as random effects, instead of fixed effects, then we'd have an extra index in the error term. $\endgroup$ Commented Jan 20, 2021 at 12:20
  • 3
    $\begingroup$ In addition to Guilherme's excellent answers, my opinion is that GAMs don't bring enough advantages over flexible parametric models that use restricted cubic splines (natural splines). You can use the parametric approach with any model. In this case embed it in quantile regression. Note that by using either GAM or regular quantile regression you are sacrificing a good deal of precision & power, so also consider semiparametric ordinal regression models which can estimate quantiles, mean, and exceedance probabilities all in one as exempified in my RMS course notes. $\endgroup$ Commented Jan 20, 2021 at 13:40

Not the answer you're looking for? Browse other questions tagged or ask your own question.