Multiple Regression Basics
Motivation
Introductory Example: tree dataset from R
using Statistics
using Plots
using RDatasets
= dataset("datasets", "trees")
trees
scatter(trees.Volume, trees.Girth,
=false, xlabel="Girth", ylabel="Volume") legend
Aim: Find relationship between the response variable volume
and the explanatory variable/covariate girth
? Can we predict the volume of a tree given its girth?
scatter(trees.Girth, trees.Volume,
=false, xlabel="Girth", ylabel="Volume")
legendplot!(x -> -37 + 5*x)
First Guess: There is a linear relation!
Simple Linear Regression
Main assumption: up to some error term, each measurement of the response variable \(y_i\) depends linearly on the corresponding value \(x_i\) of the covariate
\(\leadsto\) (Simple) Linear Model: \[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,\] where \(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\) are independent normally distributed errors with unknown variance \(\sigma^2\).
Aim: Find the straight line that fits best, i.e., find the optimal estimators for \(\beta_0\) and \(\beta_1\).
Typical choice: Least squares estimator (= maximum likelihood estimator for normal errors)
\[ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 \]
where \(\mathbf{y}\) is the vector of responses, \(\mathbf{x}\) is the vector of covariates and \(\mathbf{1}\) is a vector of ones.
Written in matrix style:
\[ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 \]
Note: There is a closed-form expression for \((\hat \beta_0, \hat \beta_1)\). We will not make use of it here, but rather use Julia to solve the problem.
lm(@formula(Volume ~ Girth), trees)
Interpretation of the Julia output:
column
estimate
: least square estimates for \(\hat \beta_0\) and \(\hat \beta_1\)column
Std. Error
: estimated standard deviation \(\hat s_{\hat \beta_i}\) of the estimator \(\hat \beta_i\)column
t value
: value of the \(t\)-statistics\[ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, \]
Under the hypothesis \(\beta_i=0\), the test statistics \(t_i\) would follow a \(t\)-distribution.
column
Pr(>|t|)
: \(p\)-values for the hypotheses \(\beta_i=0\) for \(i=0,1\)
The command rand(n)
generates a sample of n
βrandomβ (i.e., uniformly distributed) random numbers. If you want to sample from another distribution, use the Distributions
package, define an object being the distribution of interest, e.g. d = Normal(0.0, 2.0)
for a normal distribution with mean 0.0 and standard deviation 2.0, and sample n
times from this distribution by rand(d, n)
.
- Generate \(n=20\) covariates \(\mathbf{x}\) randomly.
- Given these covariates and the true parameters \(\beta_0=-3\), \(\beta_1=2\) and \(\sigma=0.5\), simulate responses from a linear model (with normally distributed errors) and estimate the coefficients \(\beta_0\) and \(\beta_1\).
- Play with different choices of the parameters above (including the sample size \(n\)) to see the effects on the parameter estimates and the \(p\)-values.
Multiple Regression Model
Idea: Generalize the simple linear regression model to multiple covariates, w.g., predict volume
using girth
and `height``.
\(\leadsto\) Linear Model: \[y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,\]where
\(y_i\): \(i\)-th measurement of the response,
\(x_{i1}\): \(i\) th value of first covariate,
β¦
\(x_{ip}\): \(i\)-th value of \(p\)-th covariate,
\(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\): independent normally distributed errors with unknown variance \(\sigma^2\).
Task: Find the optimal estimators for \(\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)\).
Our choice again: Least squares estimator (= maximum likelihood estimator for normal errors)
\[ \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 \]
where \(\mathbf{y}\) is the vector of responses, \(\mathbf{x}\)_j is the vector of the \(j\) th covariate and \(\mathbf{1}\) is a vector of ones.
Written in matrix style:
\[ \mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 \]
Defining the design matrix
\[ \mathbf{X} = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \ldots & x_{np} \end{array}\right) \qquad (\text{size } n \times (p+1)), \]
we get the short form
\[ \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]
[use Julia code (existing package) to perform linear regression for volume ~ girth + height
]
The interpretation of the Julia output is similar to the simple linear regression model, but we provide explicit formulas now:
parameter estimates:
\[ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]
estimated standard errors:
\[ \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p-1} \|\mathbf{y} - \mathbf{X} \beta\|^2} \]
\(t\)-statistics:
\[ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. \]
\(p\)-values:
\[ p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p-1} \]
- Implement functions that estimate the \(\beta\)-parameters, the corresponding standard errors and the \(t\)-statistics.
- Test your functions with the `treeβ data set and try to reproduce the output above.
Which model is the best? For linear models, one often uses the \(R^2\) characteristic. Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model.
r2(linmod1)
r2(linmod2)
= lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)
linmod3
r2(linmod3)
The more covariates you add the more variance can be explained by the linear model - \(R^2\) increases. In order to balance goodness-of-fit of a model and its complexity, information criteria such as aic
are considered.
Generalized Linear Models
Classical linear model
\[ \mathbf{y} = \mathbf{X} \beta + \varepsilon \]
implies that \[ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).\]
In particular, the conditional expectation satisfies \(\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta\).
However, the assumption of a normal distribution is unrealistic for non-continuous data. Popular alternatives include:
for counting data: \[ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) \]
Here, the components are considered to be independent and the exponential function is applied componentwise.
for binary data: \[ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \]
Again, the components are considered to be independent and all the operations are applied componentwise.
All these models are defined by the choice of a family of distributions and a function \(g\) (the so-called link function) such that
\[ \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). \]
For the models above, these are:
Type of Data | Distribution Family | Link Function |
---|---|---|
continuous | Normal | identity: \[ g(x)=x \] |
count | Poisson | log: \[ g(x) = \log(x) \] |
binary | Bernoulli | logit: \[ g(x) = \log\left( \frac{x}{1-x} \right) \] |
In general, the parameter vector \(\beta\) is estimated via maximizing the likelihood, i.e.,
\[ \hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), \]
which is equivalent to the maximization of the log-likelihood, i.e.,
\[ \hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), \]
In the Gaussian case, the maximum likelihood estimator is identical to the least squares estimator considered above.
using CSV
using HTTP
= HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
http_response = DataFrame(CSV.File(http_response.body))
SwissLabor
"participation"] .= (SwissLabor.participation .== "yes")
SwissLabor[!,
= glm(@formula(participation ~ age^2),
model Binomial(), ProbitLink()) SwissLabor,
- Reproduce the results of our data analysis of the
tree
data set using a generalized linear model with normal distribution family. - Generate \(n=20\) random covariates \(\mathbf{x}\) and Poisson-distributed counting data with parameters \(\beta_0 + \beta_1 x_i\). Re-estimate the parameters by a generalized linear model.
Outlook: Linear Mixed Models
In the linear regression models so far, we assumed that the response variable \(\mathbf{y}\) depends on the design matrix of covariates \(\mathbf{X}\) - which are assumed to be given/fixed - multiplied by the so-called fixed effects coefficients \(\mathbf{X}\beta\) and independent errors \(\varepsilon\). However, in many situations, there are also random effects on several components of the response variable. These can be included in the model by adding another design matrix \(\mathbf{Z}\) multiplied by a random vector \(u\), the so-called random effects coefficients, that are assumed to be jointly normally distributed with mean vector \(0\) and variance-covariance matrix \(\Sigma\) (typically not a diagonal matrix). In matrix notation, we have the following form:
\[ \mathbf{y} = \mathbf{X} \beta + \mathbf{Z}u + \varepsilon \]
Maximizing the likelihood, we can estimate \(\beta\) and optimally predict the random vector \(u\).