One of the main uses for R is for exploration and learning. Let’s say that I wanted to learn simple linear regression (the bread and butter of statistics) and see how the formulas work. I could simulate a simple example and fit the regression with R:
library(arm) # For display()
# Simulate 5 observations
set.seed(50)
x <- 1:5
y <- 2 + 3*x + rnorm(5, mean = 0, sd = 3)
# Fit regression
reg <- lm(y ~ x, dat)
display(reg)
# lm(formula = y ~ x, data = dat)
# coef.est coef.se
# (Intercept) 3.99 3.05
# x 2.04 0.92
# ---
# n = 5, k = 2
# residual sd = 2.91, R-Squared = 0.62
# Plot it
plot(y ~ x)
abline(coef(reg))
The formulas for the intercept (\(b_0\)) and the slope (\(b_1\)) are pretty simple, and I have been told that there is a generic expression that instead uses matrices.
\(b_1 = \frac{\sum{x y} - n \bar{x} \bar{y}}{\sum{x x} - n \bar{x}^2}\)
\(b_0 = \bar{y} - b_1 \bar{x}\)
\(\boldsymbol{b} = \boldsymbol{X}`\boldsymbol{X}^{-1} \boldsymbol{Xy}\)
How do the contents of the matrices and the simple formulates relate to each other?
# Formulas for slope and intercept
b1 <- (sum(x*y) - length(x)*mean(x)*mean(y))/(sum(x*x) - length(x)*mean(x)^2)
b0 <- mean(y) - b1*mean(x)
Funnily enough, looking at the matrices we can see similar sums of squares and crossproducts as in the formulas.
X <- model.matrix(reg)
# (Intercept) x
# 1 1 1
# 2 1 2
# 3 1 3
# 4 1 4
# 5 1 5
# attr(,"assign")
# [1] 0 1
t(X) %*% X
# (Intercept) x
# (Intercept) 5 15
# x 15 55
# So X`X contains bits and pieces of the previous formulas
length(x)
# [1] 5
sum(x)
# [1] 15
sum(x*x)
# [1] 55
# And so does X`y
t(X) %*% y
# [,1]
# (Intercept) 50.61283
# x 172.27210
sum(y)
# [1] 50.61283
sum(x*y)
# [1] 172.2721
# So if we combine the whole lot and remember that
# solves calculates the inverse
solve(t(X) %*% X) %*% t(X) %*% y
# [,1]
# (Intercept) 3.992481
# x 2.043362
But I have been told that R (as most statistical software) doesn’t use the inverse of the matrix for estimating the coefficients. So how does it work?
If I type lm
R will print the code of the lm()
function. A quick look will reveal that there is a lot of code reading the arguments and checking that everything is OK before proceeding. However, the function then calls something else: lm.fit()
. With some trepidation I type lm.fit
, which again performs more checks and then calls something with a different notation:
z <- .Call(C_Cdqrls, x, y, tol, FALSE)
This denotes a call to a C language function, which after some searching in Google we find in a readable form in the lm.c
file. Another quick look brings more checking and a call to Fortran code:
F77_CALL(dqrls)(REAL(qr), &n, &p, REAL(y), &ny, &rtol,
REAL(coefficients), REAL(residuals), REAL(effects),
&rank, INTEGER(pivot), REAL(qraux), work);
which is a highly tuned routine for QR decomposition in a linear algebra library. By now we know that the general matrix expression produces the same as our initial formula, and that the R lm()
function does not use a matrix inverse but QR decomposition to solve the system of equations.
One of the beauties of R is that brought the power of statistical computing to the masses, by not only letting you fit models but also having a peek at how things are implemented. As a user, I don’t need to know that there is a chain of function calls initiated by my bread-and-butter linear regression. But it is comforting to the nerdy me, that I can have a quick look at that.
All this for free, which sounds like a very good deal to me.