Week 3

Today’s discussion: Calculating MSE for two models. Comparing two models. Interactions (Concurvity) in GAMs. Random effects and GAMMs.

Using ERP Data. There is some preliminary data wrangling that I don’t focus on here, but you can see the R code to take the original files and merge everything into one set here datify.R

It is incredibly important that we convert the subject to a factor (even though it is coded as a number). In lme4 notation, we can have (1|subject) as a random effect term even if subject is a numeric. In mgcv, the gam function will not automatically convert a numeric to a factor.

#Gram Cat is my condition of interest. 
#time is the subject 
#We have several electrodes worth of data
# CP, Pz
#[1] "time"    "FP1"     "FP2"     "F7"      "F3"      "Fz"      "F4"      "F8"     
#[9] "FC5"     "FC1"     "FC2"     "FC6"     "T7"      "C3"      "Cz"      "C4"     
#[17] "T8"      "CP5"     "CP1"     "CP2"     "CP6"     "P7"      "P3"      "Pz"     
#[25] "P4"      "P8"      "O1"      "Oz"      "O2"      "LO1"     "LO2"     "IO1"    
#[33] "M2"      "VEOG"    "HEOG"    "LF"      "RF"      "LP"      "RP"      "CP"     

df$time = as.numeric(df$time)

#Whichever Electrode we are interested in: convert from chr to numeric
df$CP = as.numeric(df$CP)
df$Pz = as.numeric(df$Pz)
df$FP2 = as.numeric(df$FP2)
df$F7 = as.numeric(df$F7)
df$gramCat = as.factor(df$gramCat)
df$subject = as.factor(df$subject)

range(as.numeric(df$time))
## [1] -200  995
ggplot(df, aes(x=time, y=CP,color=gramCat)) + stat_smooth()

ggplot(df, aes(x=time, y=Pz,color=gramCat)) + stat_smooth()

ggplot(df, aes(x=time, y=FP2,color=gramCat)) + stat_smooth()

# A generalized additive model with simple interaction
gam1 = gam(CP ~ s(time) + s(time, by=gramCat) + gramCat ,data=df)

summary(gam1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time) + s(time, by = gramCat) + gramCat
## 
## Parametric coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.70738    0.04145   41.19   <2e-16 ***
## gramCatUngrammatical  1.15962    0.05862   19.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf Ref.df     F  p-value    
## s(time)                      8.2041 8.5761 95.51  < 2e-16 ***
## s(time):gramCatGrammatical   0.6667 0.6667 18.04 0.000527 ***
## s(time):gramCatUngrammatical 8.0315 8.5255 34.47  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 28/29
## R-sq.(adj) =  0.309   Deviance explained = 31.1%
## GCV = 7.8523  Scale est. = 7.836     n = 9120
visreg(gam1,"time")

visreg(gam1,"time","gramCat")

visreg(gam1,"time","gramCat",overlay=TRUE)

Modeling Testing and Comparison

Calculating MSE = \(\frac{1}{N} \sum_i^N(y_i-\hat{y_i})^2\)

mean(resid(gam1)^2)
## [1] 7.81976

MSE is used as a general measure of model predictiveness and can be used to assess different models of the same data. It is tempting when using GAMs to fit two models and compare either MSE or deviance explained.

gam2 = gam(CP ~ s(time),data=df)
gam3 = gam(CP ~ s(time) + gramCat,data=df)

#MSE

mean(resid(gam2)^2)
## [1] 8.978433
mean(resid(gam3)^2)
## [1] 8.642242
mean(resid(gam1)^2)
## [1] 7.81976
#Deviance Explained for both
summary(gam2)$dev.expl 
## [1] 0.2083448
summary(gam3)$dev.expl 
## [1] 0.2379878
summary(gam1)$dev.expl
## [1] 0.3105084

We see a general increase for deviance explained in the model with grammatical category and then more so with the interaction.

summary(gam2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.28719    0.03139   72.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(time) 8.792  8.987 265.7  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.208   Deviance explained = 20.8%
## GCV = 8.9977  Scale est. = 8.9881    n = 9120
summary(gam3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time) + gramCat
## 
## Parametric coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.70738    0.04356   39.20   <2e-16 ***
## gramCatUngrammatical  1.15962    0.06160   18.82   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(time) 8.799  8.988 276.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.237   Deviance explained = 23.8%
## GCV = 8.6627  Scale est. = 8.6525    n = 9120
summary(gam1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time) + s(time, by = gramCat) + gramCat
## 
## Parametric coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.70738    0.04145   41.19   <2e-16 ***
## gramCatUngrammatical  1.15962    0.05862   19.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                 edf Ref.df     F  p-value    
## s(time)                      8.2041 8.5761 95.51  < 2e-16 ***
## s(time):gramCatGrammatical   0.6667 0.6667 18.04 0.000527 ***
## s(time):gramCatUngrammatical 8.0315 8.5255 34.47  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 28/29
## R-sq.(adj) =  0.309   Deviance explained = 31.1%
## GCV = 7.8523  Scale est. = 7.836     n = 9120
anova(gam1,gam2, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: CP ~ s(time) + s(time, by = gramCat) + gramCat
## Model 2: CP ~ s(time)
##   Resid. Df Resid. Dev      Df Deviance  Pr(>Chi)    
## 1    9101.1      71316                               
## 2    9110.2      81883 -9.1101   -10567 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Wood states in several places online that it is best to use the p-values from the larger model rather than p-values calculated through model comparison to assess whether or not a term should be left in the model. The mgcv has a more updated and optimized version than discussed in Woods(2006) and in online forums. In other words, use the p-values from the fully saturated model to assess whether or not to leave in a predictor and not some other model fitting technique (e.g. anova or AIC). This isn’t what is typically recommended for regression analysis but the same stepwise regression techniques will not work the same with GAMs. Even if you are googling for information for this topic, pay attention to when a posting was made – Woods has a 2013/2012 paper that updates the model assessment techniques for GAM. He makes different recommendations that are based on on-going research: recent posts > more historical posts.

Types of Smooth

\(\lambda\), the smoothing parameter, is a number (0,\(\infty\)) where \(\lambda = 0\) fits a unpenalized spline to the data (maximal wiggleness) and \(\lambda = \infty\) implies a straight line. The penalization process is finding the best value of \(\lambda\) that doesn’t overfit the data, but does capture the most information in your data.

From Wood (2006:226, Table 5.1)

  1. bs=“tp” Thin plane regression spline (TPRS) Advantages: Can smooth wrt any number of covariates;invariant to rotation of covariate axes ;can select penalty order; No knots and some optimality properties Disadvantages: Computationally costly for large data sets. Not invariant to covariate rescaling.
  2. bs=“ts” TPRS with Shrinkage. Advantages: As TPRS, but smothness can zero term completely. Disadvantages: Same as TRRS
  3. bs=“cr” Cubic Regression Spline (CRS). Advantages: Computationally cheap. Directly interpretable parameters.(JR: but seriously, don’t interpret the parameters) Disadvantages: Can only smooth wrt to 1 covariate. Knot based. Doesn’t have have TPRS optimality.
  4. bs=“cs” CRS with Shrinkage. Advantages: AS CRS, but smoothness can go to zero completely Disadvantages: AS CRS
  5. bs=“cc” Cyclic CRS. Advantages: AS CRS, but start point same as end point. Disadvantages: As CRS.
  6. bs=“pp” P-splines. Advantages: Any combination of basis and penality order available. Perform well in tensor products. Disadvantages: Based on equally spaced knots. Penalties awkward to intepret. No optimality properties available.

This is presented for completedness. You should pick one basis, based on the research question, the data and recommendations above. Do not try to optimize the process unless this is indication of misfit or non-convergence. In other words, don’t think there is one best basis and try to find it by running all possible forms.

Tensor Product: The te() function in gam (rather than the s()) does this. See documentation, but for lingusitic data (and psycholinguistic data) doesn’t seem to be implemented at all. Useful when interactions are in different units.

Concurvity (Covariate Interaction in GAMs)

  1. s(): For single predictors or when the variables are the same scale. Doesn’t include main effects if interaction.
  2. te(): for modeling 2- or n-dimensional interaction surfaces of variables that are not on the same scale. Includes main effects.
  3. ti(): for modeling 2- or n-dimensional interaction surfaces that do not include the main effects (i.e. only models interaction).

For example, te(time,F7) would model \(f_1(time)+f_2(F7)+f_3(time,F7)\) while ti(time,F7) would only model \(f_1(time,F7)\). Without getting too far into the mathematical weeds here, if you absolutely must keep two covariates on different scales (here actually this might make sense), I would suggest fitting the full model, te(), first, and then proceeding to fit smaller ones if there are convergence errors or the visualization looks problematic.

The definition of concurvity is first found in Buja et al. (1989) and is essentially the non-linear form of collinearity. For colinearity, we say that two predictors, \(x_1\) and \(x_2\) are collinear when they are linear comibinations (or near-linear combinations) of one another (e.g. \(x_1 = 2 + 4*x_2\)). This can be easily measured by correlation between \(x_1\) and \(x_2\): the closer the correlation is to one or negative one, the more collinear the two predictors are. This causes a problem for regression analysis because the more correlated two predictors are, the harder it is two disentangle the effects of \(x_1\) from \(x_2\) on some y. Sometimes this is handled by residualizaiton (i.e. regressing \(x_1\) on \(x_2\) and then fitting the residuals for that regression rather than \(x_2\) as a predictor in the original model). Recently, this has been questioned as an appropraite method (see York, 2012). The non-linear form of collinearity is concurvity: when the smooths for two predictors are related to each other in non-linear ways.

So, how important is concurvity between covariates? See graphically illustration below:

A True Description of Concurvity

There is conflicting advice in the literature. On the one hand, Woods (2006:180) states that if you have convergence, isn’t as problematic. Yet, the mgcv library provides a measure of concurvity and states in the documentation if such measures are above > 1.0 then you should be concerned. Also, concurvity can be expected with spatial data.

If your research questions are particularly well suited to discussing 3-D and 4-D non-linear effects between predictors on the dependent variable of interest, then you should feel free to worry about concurvity and callibrate your model to appropriately account for it.

Otherwise, unless their is an issue with convergence that you think could be resolved with including an interaction, adding these interactions to the model increases the likelihood that your model will produce very spurious results unless you have a mountain of data. (This is the same issue with higher order interactions in regular regression models–unless you have the appropriate coverage for it or an a priori theoretical reason for doing so, it is best not to include interactions unless you have lots of data e.g. 10,000 or more obsevations).

From the package documentation (where the most up-to-date advice on this is). The concurvity function produces a measure of this:

This routine computes three related indices of concurvity, all bounded between 0 and 1, with 0 indicating no problem, and 1 indicating total lack of identifiability. The three indices are all based on the idea that a smooth term, f, in the model can be decomposed into a part, g, that lies entirely in the space of one or more other terms in the model, and a remainder part that is completely within the term’s own space. If g makes up a large part of f then there is a concurvity problem. The indices used are all based on the square of ||g||/||f||, that is the ratio of the squared Euclidean norms of the vectors of f and g evaluated at the observed covariate values.

The three measures are as follows

worst This is the largest value that the square of ||g||/||f|| could take for any coefficient vector. This is a fairly pessimistic measure, as it looks at the worst case irrespective of data. This is the only measure that is symmetric.

observed This just returns the value of the square of ||g||/||f|| according to the estimated coefficients. This could be a bit over-optimistic about the potential for a problem in some cases.

estimate This is the squared F-norm of the basis for g divided by the F-norm of the basis for f. It is a measure of the extent to which the f basis can be explained by the g basis. It does not suffer from the pessimism or potential for over-optimism of the previous two measures, but is less easy to understand.

Let’s take another measurement: F7. Is there a non-linear relation between F7 and Time.

df$F7 = as.numeric(df$F7)

gamb1 = gam(F7~s(time),data=df)
summary(gamb1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## F7 ~ s(time)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.12968    0.02456   45.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(time) 8.915  8.998 94.04  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0846   Deviance explained = 8.55%
## GCV = 5.5092  Scale est. = 5.5032    n = 9120

I know that if I include both time and F7 in a model, this should induce concurvity.

gamcc1 = gam(CP ~ s(time) + s(F7) + gramCat,data=df)

visreg(gam1,"time")

visreg(gamcc1,"time")

summary(gamcc1)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time) + s(F7) + gramCat
## 
## Parametric coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.70889    0.04126   41.42   <2e-16 ***
## gramCatUngrammatical  1.15659    0.05851   19.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F p-value    
## s(time) 8.864  8.994 213.3  <2e-16 ***
## s(F7)   7.428  8.432 130.5  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.319   Deviance explained =   32%
## GCV = 7.7376  Scale est. = 7.7221    n = 9120
concurvity(gamcc1)
##               para    s(time)      s(F7)
## worst    0.5028127 0.17401591 0.17644007
## observed 0.5028127 0.10844890 0.08925496
## estimate 0.5028127 0.08570264 0.05582867

Random Effects and Beyond

We can incorporate random effects into a GAM: each RE term can have an intercept, slope and smooth associated with it.

Random Effect Intercepts

gamint = gam(CP ~ s(time) + gramCat + s(time, by=gramCat) + s(subject, bs="re") ,data=df)

summary(gamint)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time) + gramCat + s(time, by = gramCat) + s(subject, bs = "re")
## 
## Parametric coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.70738    0.34879   4.895 9.99e-07 ***
## gramCatUngrammatical  1.15962    0.04984  23.266  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                  edf  Ref.df      F  p-value    
## s(time)                       8.1934  8.5785 131.67  < 2e-16 ***
## s(time):gramCatGrammatical    0.6667  0.6667  23.64 7.23e-05 ***
## s(time):gramCatUngrammatical  8.1764  8.5741  47.94  < 2e-16 ***
## s(subject)                   17.9076 18.0000 193.89  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 47/48
## R-sq.(adj) =  0.501   Deviance explained = 50.3%
## GCV = 5.6869  Scale est. = 5.6639    n = 9120
summary(gamint)$dev.expl
## [1] 0.5026216
summary(gam1)$dev.expl
## [1] 0.3105084
visreg(gamint,"time", by="subject", main="Subject Intercepts")

Random Effect Slopes

We can add a slope for time where we are controlling for the trend in each subject by time (i.e. the average linear effect of time by subject).

gamint2 = gam(CP ~ s(time) + gramCat + s(time, by=gramCat) + s(subject, bs="re") + s(subject, time, bs="re") ,data=df)
summary(gamint2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time) + gramCat + s(time, by = gramCat) + s(subject, bs = "re") + 
##     s(subject, time, bs = "re")
## 
## Parametric coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.70738    0.24324   7.019 2.39e-12 ***
## gramCatUngrammatical  1.15962    0.04834  23.990  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                  edf  Ref.df       F  p-value    
## s(time)                       8.2129  8.5854  139.70  < 2e-16 ***
## s(time):gramCatGrammatical    0.6667  0.6667   24.84 4.75e-05 ***
## s(time):gramCatUngrammatical  8.2029  8.5829   50.81  < 2e-16 ***
## s(subject)                   17.4314 18.0000 3192.74  < 2e-16 ***
## s(subject,time)              17.2556 18.0000 2884.63  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 66/67
## R-sq.(adj) =   0.53   Deviance explained = 53.3%
## GCV = 5.3589  Scale est. = 5.3273    n = 9120
summary(gamint2)$dev.expl
## [1] 0.5330451
summary(gam1)$dev.expl
## [1] 0.3105084
visreg(gamint2,"time", by="subject",main="Subject Slopes")

### Random Effect Smooths

This one is not to difficult to understand, once you understand slopes and intercepts: you are controlling for the fact that each individual person has a separate smooth (non-linear function) of a continuous predictor. The smooth to do this in the mgcv sets up an interaction between the covariate (here time) and each level of subject. We want to use a shared smoothing parameter (rather than estimating separate smooths for each subject).

gamint3 = gam(CP ~ s(time) + gramCat + s(time, by=gramCat) + s(subject, time, bs="fs", m=1) ,data=df)
## Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has
## repeated 1-d smooths of same variable.
summary(gamint3)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## CP ~ s(time) + gramCat + s(time, by = gramCat) + s(subject, time, 
##     bs = "fs", m = 1)
## 
## Parametric coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.838e+03  1.034e+03  -1.778   0.0754 .  
## gramCatUngrammatical  1.160e+00  3.868e-02  29.983   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                  edf Ref.df       F  p-value    
## s(time)                        7.878  7.941 103.256  < 2e-16 ***
## s(time):gramCatGrammatical     7.825  8.058   5.340 8.57e-07 ***
## s(time):gramCatUngrammatical   1.235  1.254   0.706    0.424    
## s(subject,time)              166.903 93.000 128.521    0.784    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Rank: 199/200
## R-sq.(adj) =  0.699   Deviance explained = 70.5%
## GCV = 3.4815  Scale est. = 3.4106    n = 9120
summary(gamint3)$dev.expl
## [1] 0.7054082
summary(gam1)$dev.expl
## [1] 0.3105084
visreg(gamint3,"time", by="subject",main="Subject Smooths")

NB: There is another function, gamm that fits generalized additive mixed effects models. I find that gam() with bs=“re” fits better and has less convergence issues. Basically, you can’t specify random smooths (that I’m aware of) and it fits the random effects separately via the nlme package.

Assignment 3

You can focus on one of two things: the random effects structure or concurvity (depending on your research question of interest).

  1. Explore models of covariate (or random effects or both) including different interactions/re structures and smooths.
  2. Assess the models: does one model perform better (i.e. higher deviance explained)? Does that conflict with the intepretability of the model (i.e. which model best addresses the scientific question you are interested in)?

Submit a .pdf of the Rmd output. Email to jroy042@illinois.edu with a Subject line: “Nonlinear Assignment 3”. Please submit by Friday, October 14 at noon.

References

Buja, A., Hastie, T., & Tibshirani, R. (1989). Linear smoothers and additive models. The Annals of Statistics, 453-510.

York, R. (2012). Residualization is not the answer: Rethinking how to address multicollinearity. Social science research, 41(6), 1379-1386.

Wood, S. (2006). Generalized additive models: an introduction with R. CRC press.