What does GLM mean?
GLM stands for Generalized Linear Models. These statistical models are used when one wants to predict a variable using known functions of the predictor variables. Under these conditions, GLM are a powerful statistical tool used by researchers throughout the world.
Example: Annual income increase
Let’s explore an example for a salary that increases every year and let’s predict future salaries.
# creating income date from 2010 to 2020
year = 2010:2020
# crating a base salary, which begins with 1000 and increases 100 yearly. In other words, we are supposing that the income increases 100 yearly.
income.base = seq(1000, # start salary
length.out = length(year), # vector size
by = 100) # yearly increase
# Let's define an income deviation, which will be added to the income base to obtain the real income. That is, we only know the real income; we do not know the base income that increases 100 yearly and we wish to find out this 100 yearly increase.
set.seed(2) # defining a seed so that we can repeat this simulation
income.deviation = rnorm(length(year),
sd = 50) # standard deviation for the income deviation
income.real = income.base + income.deviation
# let's plot the data and see how it looks
plot(year, # x
income.base, # y
type = 'l', # plot line
xlab = "Year", # x axis name
ylab = "Income", # y axis name
lwd = 10, # line thickness
mgp=c(2.3,0.7,0), # change the position of texts and axes
cex.lab = 1.4, # increase size of text of the axes
cex.axis = 1.4) # increase size of numbers in axes
grid() # create grid lines
lines(year, # x
income.base, # y
lwd = 10) # line thickness
points(year, # x
income.real, # y
cex = 2, # point size
pch = 16, # point type
col = "red") # point color
# creating legend
legend('topleft', # legend position
c("Real", "Measured"), # legend text
lwd = 7, # line thickness in legend
lty = c(1, 0), # line type in legend
col = c('black', 'red'), # line color in legend
pch = c(-1, 16), # point type in legend
cex = 2) # increase legend size
data:image/s3,"s3://crabby-images/fb9d2/fb9d29150d25ff1b5250e074ad2c1dbc1c8555aa" alt=""
Now that we have the dataset, we can train our GLM model. As the model we will train is simple, we will use the lm function in R.
# training the GLM model
model.GLM = lm (income.real # what we want to predict
~ # prediction formula comes after ~
year) # what we will use to predict
# let's look at the summary of the model
summary(model.GLM)
##
## Call:
## lm(formula = income.real ~ year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.80 -33.28 -11.15 16.99 80.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.083e+05 9.073e+03 -22.95 2.68e-09 ***
## year 1.041e+02 4.503e+00 23.12 2.52e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.22 on 9 degrees of freedom
## Multiple R-squared: 0.9834, Adjusted R-squared: 0.9816
## F-statistic: 534.6 on 1 and 9 DF, p-value: 2.517e-09
# the model found an annual increase of 104 +- 4, which inclues the 100 yearly increase of the base income
# let's plot the model and compare the GLM function
income.estimated = predict(model.GLM, data.frame(year = year))
plot(year, # x
income.base, # y
type = 'l', # plot line
xlab = "Year", # x axis name
ylab = "Income", # y axis name
lwd = 10, # line thickness
mgp=c(2.3,0.7,0), # change position of axes texts
cex.lab = 1.4, # increase axes text
cex.axis = 1.4) # increase number of axes
grid() # create grid lines
lines(year, # x
income.base, # y
lwd = 10) # line thickness
lines(year, # x
income.estimated, # y
lwd = 4, # line thickness
lty = 2, # line type
col = "red") # line color
# creating a legend
legend('topleft', # legend position
c("Real", "Estimated"), # legend text
lwd = 7, # legend line thickness
lty = c(1, 2), # legend line type
col = c('black', 'red'), # legend line color
cex = 2) # increase legend size
data:image/s3,"s3://crabby-images/4eb68/4eb68b268a836417471cc1fd005b3e93cdba042d" alt=""
to predict future salary, we can plot the estimated income with a 95% confidence interval.
years.future = 2010:2030
income.predicted = predict(model.GLM, data.frame(year = years.future), interval = "prediction")
plot(years.future, # x
income.predicted[,1], # y
type = 'l', # plot line
xlab = "Year", # x axis name
ylab = "Income", # y axis name
lwd = 10, # line thickness
mgp=c(2.3,0.7,0), # change position of axes texts
cex.lab = 1.4, # increase axes text
cex.axis = 1.4) # increase number of axes
grid() # create grid lines
lines(years.future, # x
income.predicted[,1], # y
lwd = 10) # line thickness
# lower confidence interval
lines(years.future, # x
income.predicted[,2], # y
lwd = 4, # line thickness
lty = 2, # line type
col = "red") # line color
# upper confidence interval
lines(years.future, # x
income.predicted[,3], # y
lwd = 4, # line thickness
lty = 2, # line type
col = "red") # line color
data:image/s3,"s3://crabby-images/41f72/41f72006ba71e50f5514eb8e838a6654423fbd1d" alt=""
Where can I go to learn more?
I recommend the following literature:
- Book that introduces several statistical techniques in the R environment . Zuur AF, Ieno EN, Walker NJ, Saveliev AA, Smith GM. Mixed Effects Models and Extensions in Ecology with R. Springer, New York, 2009.
- Book that discusses GLM theory in R. Pinheiro JC, Bates DM. Mixed-effects models in S and S-PLUS. Springer, New York, 2000.