Today’s discussion: Calculating MSE for two models. Comparing two models. Interactions (Concurvity) in GAMs. Random effects and GAMMs.
Using ERP Data. There is some preliminary data wrangling that I don’t focus on here, but you can see the R code to take the original files and merge everything into one set here datify.R
It is incredibly important that we convert the subject to a factor (even though it is coded as a number). In lme4 notation, we can have (1|subject) as a random effect term even if subject is a numeric. In mgcv, the gam function will not automatically convert a numeric to a factor.
#Gram Cat is my condition of interest.
#time is the subject
#We have several electrodes worth of data
# CP, Pz
#[1] "time" "FP1" "FP2" "F7" "F3" "Fz" "F4" "F8"
#[9] "FC5" "FC1" "FC2" "FC6" "T7" "C3" "Cz" "C4"
#[17] "T8" "CP5" "CP1" "CP2" "CP6" "P7" "P3" "Pz"
#[25] "P4" "P8" "O1" "Oz" "O2" "LO1" "LO2" "IO1"
#[33] "M2" "VEOG" "HEOG" "LF" "RF" "LP" "RP" "CP"
df$time = as.numeric(df$time)
#Whichever Electrode we are interested in: convert from chr to numeric
df$CP = as.numeric(df$CP)
df$Pz = as.numeric(df$Pz)
df$FP2 = as.numeric(df$FP2)
df$F7 = as.numeric(df$F7)
df$gramCat = as.factor(df$gramCat)
df$subject = as.factor(df$subject)
range(as.numeric(df$time))
## [1] -200 995
ggplot(df, aes(x=time, y=CP,color=gramCat)) + stat_smooth()
ggplot(df, aes(x=time, y=Pz,color=gramCat)) + stat_smooth()
ggplot(df, aes(x=time, y=FP2,color=gramCat)) + stat_smooth()
# A generalized additive model with simple interaction
gam1 = gam(CP ~ s(time) + s(time, by=gramCat) + gramCat ,data=df)
summary(gam1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time) + s(time, by = gramCat) + gramCat
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.70738 0.04145 41.19 <2e-16 ***
## gramCatUngrammatical 1.15962 0.05862 19.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.2041 8.5761 95.51 < 2e-16 ***
## s(time):gramCatGrammatical 0.6667 0.6667 18.04 0.000527 ***
## s(time):gramCatUngrammatical 8.0315 8.5255 34.47 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 28/29
## R-sq.(adj) = 0.309 Deviance explained = 31.1%
## GCV = 7.8523 Scale est. = 7.836 n = 9120
visreg(gam1,"time")
visreg(gam1,"time","gramCat")
visreg(gam1,"time","gramCat",overlay=TRUE)
Calculating MSE = \(\frac{1}{N} \sum_i^N(y_i-\hat{y_i})^2\)
mean(resid(gam1)^2)
## [1] 7.81976
MSE is used as a general measure of model predictiveness and can be used to assess different models of the same data. It is tempting when using GAMs to fit two models and compare either MSE or deviance explained.
gam2 = gam(CP ~ s(time),data=df)
gam3 = gam(CP ~ s(time) + gramCat,data=df)
#MSE
mean(resid(gam2)^2)
## [1] 8.978433
mean(resid(gam3)^2)
## [1] 8.642242
mean(resid(gam1)^2)
## [1] 7.81976
#Deviance Explained for both
summary(gam2)$dev.expl
## [1] 0.2083448
summary(gam3)$dev.expl
## [1] 0.2379878
summary(gam1)$dev.expl
## [1] 0.3105084
We see a general increase for deviance explained in the model with grammatical category and then more so with the interaction.
summary(gam2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.28719 0.03139 72.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.792 8.987 265.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.208 Deviance explained = 20.8%
## GCV = 8.9977 Scale est. = 8.9881 n = 9120
summary(gam3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time) + gramCat
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.70738 0.04356 39.20 <2e-16 ***
## gramCatUngrammatical 1.15962 0.06160 18.82 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.799 8.988 276.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.237 Deviance explained = 23.8%
## GCV = 8.6627 Scale est. = 8.6525 n = 9120
summary(gam1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time) + s(time, by = gramCat) + gramCat
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.70738 0.04145 41.19 <2e-16 ***
## gramCatUngrammatical 1.15962 0.05862 19.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.2041 8.5761 95.51 < 2e-16 ***
## s(time):gramCatGrammatical 0.6667 0.6667 18.04 0.000527 ***
## s(time):gramCatUngrammatical 8.0315 8.5255 34.47 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 28/29
## R-sq.(adj) = 0.309 Deviance explained = 31.1%
## GCV = 7.8523 Scale est. = 7.836 n = 9120
anova(gam1,gam2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: CP ~ s(time) + s(time, by = gramCat) + gramCat
## Model 2: CP ~ s(time)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 9101.1 71316
## 2 9110.2 81883 -9.1101 -10567 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Wood states in several places online that it is best to use the p-values from the larger model rather than p-values calculated through model comparison to assess whether or not a term should be left in the model. The mgcv has a more updated and optimized version than discussed in Woods(2006) and in online forums. In other words, use the p-values from the fully saturated model to assess whether or not to leave in a predictor and not some other model fitting technique (e.g. anova or AIC). This isn’t what is typically recommended for regression analysis but the same stepwise regression techniques will not work the same with GAMs. Even if you are googling for information for this topic, pay attention to when a posting was made – Woods has a 2013/2012 paper that updates the model assessment techniques for GAM. He makes different recommendations that are based on on-going research: recent posts > more historical posts.
\(\lambda\), the smoothing parameter, is a number (0,\(\infty\)) where \(\lambda = 0\) fits a unpenalized spline to the data (maximal wiggleness) and \(\lambda = \infty\) implies a straight line. The penalization process is finding the best value of \(\lambda\) that doesn’t overfit the data, but does capture the most information in your data.
From Wood (2006:226, Table 5.1)
This is presented for completedness. You should pick one basis, based on the research question, the data and recommendations above. Do not try to optimize the process unless this is indication of misfit or non-convergence. In other words, don’t think there is one best basis and try to find it by running all possible forms.
Tensor Product: The te() function in gam (rather than the s()) does this. See documentation, but for lingusitic data (and psycholinguistic data) doesn’t seem to be implemented at all. Useful when interactions are in different units.
For example, te(time,F7) would model \(f_1(time)+f_2(F7)+f_3(time,F7)\) while ti(time,F7) would only model \(f_1(time,F7)\). Without getting too far into the mathematical weeds here, if you absolutely must keep two covariates on different scales (here actually this might make sense), I would suggest fitting the full model, te(), first, and then proceeding to fit smaller ones if there are convergence errors or the visualization looks problematic.
The definition of concurvity is first found in Buja et al. (1989) and is essentially the non-linear form of collinearity. For colinearity, we say that two predictors, \(x_1\) and \(x_2\) are collinear when they are linear comibinations (or near-linear combinations) of one another (e.g. \(x_1 = 2 + 4*x_2\)). This can be easily measured by correlation between \(x_1\) and \(x_2\): the closer the correlation is to one or negative one, the more collinear the two predictors are. This causes a problem for regression analysis because the more correlated two predictors are, the harder it is two disentangle the effects of \(x_1\) from \(x_2\) on some y. Sometimes this is handled by residualizaiton (i.e. regressing \(x_1\) on \(x_2\) and then fitting the residuals for that regression rather than \(x_2\) as a predictor in the original model). Recently, this has been questioned as an appropraite method (see York, 2012). The non-linear form of collinearity is concurvity: when the smooths for two predictors are related to each other in non-linear ways.
So, how important is concurvity between covariates? See graphically illustration below:
There is conflicting advice in the literature. On the one hand, Woods (2006:180) states that if you have convergence, isn’t as problematic. Yet, the mgcv library provides a measure of concurvity and states in the documentation if such measures are above > 1.0 then you should be concerned. Also, concurvity can be expected with spatial data.
If your research questions are particularly well suited to discussing 3-D and 4-D non-linear effects between predictors on the dependent variable of interest, then you should feel free to worry about concurvity and callibrate your model to appropriately account for it.
Otherwise, unless their is an issue with convergence that you think could be resolved with including an interaction, adding these interactions to the model increases the likelihood that your model will produce very spurious results unless you have a mountain of data. (This is the same issue with higher order interactions in regular regression models–unless you have the appropriate coverage for it or an a priori theoretical reason for doing so, it is best not to include interactions unless you have lots of data e.g. 10,000 or more obsevations).
From the package documentation (where the most up-to-date advice on this is). The concurvity function produces a measure of this:
This routine computes three related indices of concurvity, all bounded between 0 and 1, with 0 indicating no problem, and 1 indicating total lack of identifiability. The three indices are all based on the idea that a smooth term, f, in the model can be decomposed into a part, g, that lies entirely in the space of one or more other terms in the model, and a remainder part that is completely within the term’s own space. If g makes up a large part of f then there is a concurvity problem. The indices used are all based on the square of ||g||/||f||, that is the ratio of the squared Euclidean norms of the vectors of f and g evaluated at the observed covariate values.
The three measures are as follows
worst This is the largest value that the square of ||g||/||f|| could take for any coefficient vector. This is a fairly pessimistic measure, as it looks at the worst case irrespective of data. This is the only measure that is symmetric.
observed This just returns the value of the square of ||g||/||f|| according to the estimated coefficients. This could be a bit over-optimistic about the potential for a problem in some cases.
estimate This is the squared F-norm of the basis for g divided by the F-norm of the basis for f. It is a measure of the extent to which the f basis can be explained by the g basis. It does not suffer from the pessimism or potential for over-optimism of the previous two measures, but is less easy to understand.
Let’s take another measurement: F7. Is there a non-linear relation between F7 and Time.
df$F7 = as.numeric(df$F7)
gamb1 = gam(F7~s(time),data=df)
summary(gamb1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## F7 ~ s(time)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.12968 0.02456 45.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.915 8.998 94.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0846 Deviance explained = 8.55%
## GCV = 5.5092 Scale est. = 5.5032 n = 9120
I know that if I include both time and F7 in a model, this should induce concurvity.
gamcc1 = gam(CP ~ s(time) + s(F7) + gramCat,data=df)
visreg(gam1,"time")
visreg(gamcc1,"time")
summary(gamcc1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time) + s(F7) + gramCat
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.70889 0.04126 41.42 <2e-16 ***
## gramCatUngrammatical 1.15659 0.05851 19.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.864 8.994 213.3 <2e-16 ***
## s(F7) 7.428 8.432 130.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.319 Deviance explained = 32%
## GCV = 7.7376 Scale est. = 7.7221 n = 9120
concurvity(gamcc1)
## para s(time) s(F7)
## worst 0.5028127 0.17401591 0.17644007
## observed 0.5028127 0.10844890 0.08925496
## estimate 0.5028127 0.08570264 0.05582867
We can incorporate random effects into a GAM: each RE term can have an intercept, slope and smooth associated with it.
gamint = gam(CP ~ s(time) + gramCat + s(time, by=gramCat) + s(subject, bs="re") ,data=df)
summary(gamint)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time) + gramCat + s(time, by = gramCat) + s(subject, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.70738 0.34879 4.895 9.99e-07 ***
## gramCatUngrammatical 1.15962 0.04984 23.266 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.1934 8.5785 131.67 < 2e-16 ***
## s(time):gramCatGrammatical 0.6667 0.6667 23.64 7.23e-05 ***
## s(time):gramCatUngrammatical 8.1764 8.5741 47.94 < 2e-16 ***
## s(subject) 17.9076 18.0000 193.89 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 47/48
## R-sq.(adj) = 0.501 Deviance explained = 50.3%
## GCV = 5.6869 Scale est. = 5.6639 n = 9120
summary(gamint)$dev.expl
## [1] 0.5026216
summary(gam1)$dev.expl
## [1] 0.3105084
visreg(gamint,"time", by="subject", main="Subject Intercepts")
We can add a slope for time where we are controlling for the trend in each subject by time (i.e. the average linear effect of time by subject).
gamint2 = gam(CP ~ s(time) + gramCat + s(time, by=gramCat) + s(subject, bs="re") + s(subject, time, bs="re") ,data=df)
summary(gamint2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time) + gramCat + s(time, by = gramCat) + s(subject, bs = "re") +
## s(subject, time, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.70738 0.24324 7.019 2.39e-12 ***
## gramCatUngrammatical 1.15962 0.04834 23.990 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 8.2129 8.5854 139.70 < 2e-16 ***
## s(time):gramCatGrammatical 0.6667 0.6667 24.84 4.75e-05 ***
## s(time):gramCatUngrammatical 8.2029 8.5829 50.81 < 2e-16 ***
## s(subject) 17.4314 18.0000 3192.74 < 2e-16 ***
## s(subject,time) 17.2556 18.0000 2884.63 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 66/67
## R-sq.(adj) = 0.53 Deviance explained = 53.3%
## GCV = 5.3589 Scale est. = 5.3273 n = 9120
summary(gamint2)$dev.expl
## [1] 0.5330451
summary(gam1)$dev.expl
## [1] 0.3105084
visreg(gamint2,"time", by="subject",main="Subject Slopes")
### Random Effect Smooths
This one is not to difficult to understand, once you understand slopes and intercepts: you are controlling for the fact that each individual person has a separate smooth (non-linear function) of a continuous predictor. The smooth to do this in the mgcv sets up an interaction between the covariate (here time) and each level of subject. We want to use a shared smoothing parameter (rather than estimating separate smooths for each subject).
gamint3 = gam(CP ~ s(time) + gramCat + s(time, by=gramCat) + s(subject, time, bs="fs", m=1) ,data=df)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has
## repeated 1-d smooths of same variable.
summary(gamint3)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## CP ~ s(time) + gramCat + s(time, by = gramCat) + s(subject, time,
## bs = "fs", m = 1)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.838e+03 1.034e+03 -1.778 0.0754 .
## gramCatUngrammatical 1.160e+00 3.868e-02 29.983 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(time) 7.878 7.941 103.256 < 2e-16 ***
## s(time):gramCatGrammatical 7.825 8.058 5.340 8.57e-07 ***
## s(time):gramCatUngrammatical 1.235 1.254 0.706 0.424
## s(subject,time) 166.903 93.000 128.521 0.784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 199/200
## R-sq.(adj) = 0.699 Deviance explained = 70.5%
## GCV = 3.4815 Scale est. = 3.4106 n = 9120
summary(gamint3)$dev.expl
## [1] 0.7054082
summary(gam1)$dev.expl
## [1] 0.3105084
visreg(gamint3,"time", by="subject",main="Subject Smooths")
NB: There is another function, gamm that fits generalized additive mixed effects models. I find that gam() with bs=“re” fits better and has less convergence issues. Basically, you can’t specify random smooths (that I’m aware of) and it fits the random effects separately via the nlme package.
You can focus on one of two things: the random effects structure or concurvity (depending on your research question of interest).
Submit a .pdf of the Rmd output. Email to jroy042@illinois.edu with a Subject line: “Nonlinear Assignment 3”. Please submit by Friday, October 14 at noon.
Buja, A., Hastie, T., & Tibshirani, R. (1989). Linear smoothers and additive models. The Annals of Statistics, 453-510.
York, R. (2012). Residualization is not the answer: Rethinking how to address multicollinearity. Social science research, 41(6), 1379-1386.
Wood, S. (2006). Generalized additive models: an introduction with R. CRC press.