Categories
Generalized Additive Model (GAM) R

Introduction to GAM

What are GAMs?

GAM stands for Generalized Additive Models. This is an statistical technique used to approximate smooth functions to predict a variable. In essence, smooth functions are obtained based on several cubic functions (in practice, more complex functions are utilized).

When should I use GAMs?

You could use GAMs when you do not know what is the prediction function but know the predictors (variables used to predict). Let’s take as an example the task to predict the variation fo the temperature through the day. In this example, the temperature could be predicted using the time of the day as the predictor and let GAM learn what is the the predictor function.

Example: Predicting temperature variation through the day

Let’s explore an artificial example in which we want to predict the temperature through the day. Let’s create our dataset

# creating time from 0 to 24 h with 1000 samples
hour = seq(0, 24, length.out = 1e3)
# creating the temperature as a fix value of 25 plus a 5 degrees oscillation
temp_real = 25 + 5*sin(hour/24*2*pi)

plot(hour, # x
     temp_real, # y
     type = 'l', # plot line
     xlab = "Time of the day", # label for x axis
     ylab = expression(paste("Temperature ("^{o},"C)", sep = "")), # label for y axis
     ylim = c(18, 32), # limits of the y axis
     lwd = 10, # line thickness
     lty = 2, # line type
     xaxs = "i", # x axis start at the lower limit of x and ends with the upper limit of x
     yaxs = "i", # y axis start at the lower limit of y and ends with the upper limit of y
     mgp=c(2.3,0.7,0), # changing the position of the labels in the axes
     cex.lab = 1.4, # increasing the label size
     cex.axis = 1.4) # increasing the number size
grid() # creating grid lines

# Let's add a measurement error
set.seed(1) # defining seed to be able to repeat the simulation
errors_measurement = rnorm(length(hour))
temp_measured = temp_real + errors_measurement

lines(hour, # x
      temp_measured, # y
      col = 'red', # line color
      lwd = 4) # line thickness

# re-ploting the real temperature so that this line stays on top of the temperature measured with errors
lines(hour, # x
      temp_real, # y
      lwd = 10, # line thickness
      lty = 2) # line type

# Creating the legend
legend('topright', # legend position
       c("Real", "Measured"), # legend text
       lwd = 7, # thickness of lines in the legend
       lty = c(2, 1), # types of lines in the legend
       col = c('black', 'red'), # color of the lines in the legend
       cex = 2) # increase legend size

Now that we have created the dataset, we can train our GAM model. To do that, we will need the R libraries mgcv and nlme.

library(mgcv) # loading the mgcv library, which also loads the nlme library

## Loading required package: nlme
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.

# training the gam model
model.gam = gam(temp_measured # what we want to predict
                ~ # prediction formula after the ~ 
                  s(hour) # s(.) defines a smooth function of hour
                )
# now we look at the summary of the model
# The model was able to explain 92.3% of the deviance
summary(model.gam)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## temp_measured ~ s(hour)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.98835    0.03277   762.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df    F p-value    
## s(hour) 7.049  8.112 1451  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.922   Deviance explained = 92.3%
## GCV = 1.0827  Scale est. = 1.074     n = 1000

# Let's us plot the function learned by the model and compare it with the real temperature function
# obtaining predicted temperatures
temp_pred = predict(model.gam, data.frame(hour = hour))

# Let's replot the previous figure
plot(hour, # x
     temp_real, # y
     type = 'l', # plot line
     xlab = "Time of the day", # label for x axis
     ylab = expression(paste("Temperature ("^{o},"C)", sep = "")), # label for y axis
     ylim = c(18, 32), # limits of the y axis
     lwd = 10, # line thickness
     lty = 2, # line type
     xaxs = "i", # x axis start at the lower limit of x and ends with the upper limit of x
     yaxs = "i", # y axis start at the lower limit of y and ends with the upper limit of y
     mgp=c(2.3,0.7,0), # changing the position of the labels in the axes
     cex.lab = 1.4, # increasing the label size
     cex.axis = 1.4) # increasing the number size
grid() # creating grid lines

lines(hour, # x
      temp_measured, # y
      col = 'red', # line color
      lwd = 4) # line thickness

# re-ploting the real temperature so that this line stays on top of the temperature measured with errors
lines(hour, # x
      temp_real, # y
      lwd = 10, # line thickness
      lty = 2) # line type

lines(hour, # x
      temp_pred, # y
      lwd = 10, # line thickness
      lty = 3, # line type
      col = 'gold') # line color

# creating the legend
legend('topright', # legend position
       c("Real", "Measured", "Predicted"), # legend text
       lwd = 7, # thickness of lines in the legend
       lty = c(2, 1, 3), # types of lines in the legend
       col = c('black', 'red', 'gold'), # color of the lines in the legend
       cex = 2) # increase legend size

As we can see in the figure above, the temperature predicted with the GAM model is very similar to the real temperature.

# let's plot the difference
plot(hour, # x
     temp_real - temp_pred, # y
     type = 'l', # plot line
     xlab = "Time of the day", # label for x axis
     ylab = expression(paste("Predicted temperature error ("^{o},"C)", sep = "")), # label for y axis
     ylim = c(-0.3, 0.3), # limits of the y axis
     lwd = 10, # line thickness
     lty = 2, # line type
     xaxs = "i", # x axis start at the lower limit of x and ends with the upper limit of x
     yaxs = "i", # y axis start at the lower limit of y and ends with the upper limit of y
     mgp=c(2.3,0.7,0), # changing the position of the labels in the axes
     cex.lab = 1.4, # increasing the label size
     cex.axis = 1.4) # increasing the number size
grid() # create grid lines

# I like to replot the line so that it stays above the grid:
lines(hour, # x
      temp_real - temp_pred, # y
      lwd = 10, # line thickness
      lty = 2) # line type

Where can I go to learn more?

I recommend the following:

  • Book that introduces several statistical techniques in R. Zuur AF, Ieno EN, Walker NJ, Saveliev AA, Smith GM. Mixed Effects Models and Extensions in Ecology with R. Springer, New York, 2009.
  • Book that describes GAM theory in R. Wood SN. Generalized Additive Models: An Introduction with R (2nd ed.). CRC Press, Boca Raton, 2017.
  • Book that describes several machine learning methods. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer, New York. 2008.
  • Book that describes several machine learning methods. Efron B, Hastie T. Computer Age Statistica Inference: Algorithms, Evidence, and Data Science. Cambridge University Press, Cambridge, 2016.

Leave a Reply

Your email address will not be published. Required fields are marked *