Multiple Regression Basics

Motivation

Introductory Example: tree dataset from R

using Statistics
using Plots
using RDatasets

trees = dataset("datasets", "trees")

scatter(trees.Volume, trees.Girth,
        legend=false, xlabel="Girth", ylabel="Volume")

Aim: Find relationship between the response variable volume and the explanatory variable/covariate girth? Can we predict the volume of a tree given its girth?

scatter(trees.Girth, trees.Volume,
        legend=false, xlabel="Girth", ylabel="Volume")
plot!(x -> -37 + 5*x)

First Guess: There is a linear relation!

Simple Linear Regression

Main assumption: up to some error term, each measurement of the response variable \(y_i\) depends linearly on the corresponding value \(x_i\) of the covariate

\(\leadsto\) (Simple) Linear Model: \[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad i=1,...,n,\] where \(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\) are independent normally distributed errors with unknown variance \(\sigma^2\).

Aim: Find the straight line that fits best, i.e., find the optimal estimators for \(\beta_0\) and \(\beta_1\).

Typical choice: Least squares estimator (= maximum likelihood estimator for normal errors)

\[ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x} \beta_1\|^2 \]

where \(\mathbf{y}\) is the vector of responses, \(\mathbf{x}\) is the vector of covariates and \(\mathbf{1}\) is a vector of ones.

Written in matrix style:

\[ (\hat \beta_0, \hat \beta_1) = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}) \left( \begin{array}{c} \beta_0\\ \beta_1\end{array}\right) \right\|^2 \]

Note: There is a closed-form expression for \((\hat \beta_0, \hat \beta_1)\). We will not make use of it here, but rather use Julia to solve the problem.

lm(@formula(Volume ~ Girth), trees)

Interpretation of the Julia output:

column estimate : least square estimates for \(\hat \beta_0\) and \(\hat \beta_1\)
column Std. Error : estimated standard deviation \(\hat s_{\hat \beta_i}\) of the estimator \(\hat \beta_i\)
column t value : value of the \(t\)-statistics

\[ t_i = {\hat \beta_i \over \hat s_{\hat \beta_i}}, \quad i=0,1, \]

Under the hypothesis \(\beta_i=0\), the test statistics \(t_i\) would follow a \(t\)-distribution.
column Pr(>|t|): \(p\)-values for the hypotheses \(\beta_i=0\) for \(i=0,1\)

Tip

The command rand(n) generates a sample of n “random” (i.e., uniformly distributed) random numbers. If you want to sample from another distribution, use the Distributions package, define an object being the distribution of interest, e.g. d = Normal(0.0, 2.0) for a normal distribution with mean 0.0 and standard deviation 2.0, and sample n times from this distribution by rand(d, n).

Task 1

Generate \(n=20\) covariates \(\mathbf{x}\) randomly.
Given these covariates and the true parameters \(\beta_0=-3\), \(\beta_1=2\) and \(\sigma=0.5\), simulate responses from a linear model (with normally distributed errors) and estimate the coefficients \(\beta_0\) and \(\beta_1\).
Play with different choices of the parameters above (including the sample size \(n\)) to see the effects on the parameter estimates and the \(p\)-values.

Multiple Regression Model

Idea: Generalize the simple linear regression model to multiple covariates, w.g., predict volume using girth and `height``.

\(\leadsto\) Linear Model: \[y_i = \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} + \varepsilon_i, \qquad i=1,...,n,\]where

\(y_i\): \(i\)-th measurement of the response,
\(x_{i1}\): \(i\) th value of first covariate,
…
\(x_{ip}\): \(i\)-th value of \(p\)-th covariate,
\(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\): independent normally distributed errors with unknown variance \(\sigma^2\).

Task: Find the optimal estimators for \(\mathbf{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)\).

Our choice again: Least squares estimator (= maximum likelihood estimator for normal errors)

\[ \hat \beta = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{1} \beta_0 - \mathbf{x}_1 \beta_1 - \ldots - \mathbf{x}_p \beta_p\|^2 \]

where \(\mathbf{y}\) is the vector of responses, \(\mathbf{x}\)_j is the vector of the \(j\) th covariate and \(\mathbf{1}\) is a vector of ones.

Written in matrix style:

\[ \mathbf{\hat \beta} = \mathrm{argmin} \ \left\| \mathbf{y} - (\mathbf{1},\mathbf{x}_1,\ldots,\mathbf{x}_p) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p\end{array} \right) \right\|^2 \]

Defining the design matrix

\[ \mathbf{X} = \left( \begin{array}{cccc} 1 & x_{11} & \ldots & x_{1p} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \ldots & x_{np} \end{array}\right) \qquad (\text{size } n \times (p+1)), \]

we get the short form

\[ \mathbf{\hat \beta} = \mathrm{argmin} \ \| \mathbf{y} - \mathbf{X} \mathbf{\beta} \|^2 = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]

[use Julia code (existing package) to perform linear regression for volume ~ girth + height]

The interpretation of the Julia output is similar to the simple linear regression model, but we provide explicit formulas now:

parameter estimates:

\[ (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]
estimated standard errors:

\[ \hat s_{\beta_i} = \sqrt{([\mathbf{X}^\top \mathbf{X}]^{-1})_{ii} \frac 1 {n-p-1} \|\mathbf{y} - \mathbf{X} \beta\|^2} \]
\(t\)-statistics:

\[ t_i = \frac{\hat \beta_i}{\hat s_{\hat \beta_i}}, \qquad i=0,\ldots,p. \]
\(p\)-values:

\[ p\text{-value} = \mathbb{P}(|T| > t_i), \quad \text{where } T \sim t_{n-p-1} \]

Task 2

Implement functions that estimate the \(\beta\)-parameters, the corresponding standard errors and the \(t\)-statistics.
Test your functions with the `tree’ data set and try to reproduce the output above.

Which model is the best? For linear models, one often uses the \(R^2\) characteristic. Roughly speaking, it gives the percentage (between 0 and 1) of the variance that can be explained by the linear model.

r2(linmod1)
r2(linmod2)

linmod3 = lm(@formula(Volume ~ Girth + Height + Girth*Height), trees)

r2(linmod3)

Note

The more covariates you add the more variance can be explained by the linear model - \(R^2\) increases. In order to balance goodness-of-fit of a model and its complexity, information criteria such as aic are considered.

Generalized Linear Models

Classical linear model

\[ \mathbf{y} = \mathbf{X} \beta + \varepsilon \]

implies that \[ \mathbf{y} \mid \mathbf{X} \sim \mathcal{N}(\mathbf{X} \mathbf{\beta}, \sigma^2\mathbf{I}).\]

In particular, the conditional expectation satisfies \(\mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \mathbf{X} \beta\).

However, the assumption of a normal distribution is unrealistic for non-continuous data. Popular alternatives include:

for counting data: \[ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Poisson}(\exp(\mathbf{X}\mathbf{\beta})) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \exp(\mathbf{X} \beta) \]

Here, the components are considered to be independent and the exponential function is applied componentwise.
for binary data: \[ \mathbf{y} \mid \mathbf{X} \sim \mathrm{Bernoulli}\left( \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \right) \qquad \leadsto \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = \frac{\exp(\mathbf{X}\mathbf{\beta})}{1 + \exp(\mathbf{X}\mathbf{\beta})} \]

Again, the components are considered to be independent and all the operations are applied componentwise.

All these models are defined by the choice of a family of distributions and a function \(g\) (the so-called link function) such that

\[ \mathbb{E}(\mathbf{y} \mid \mathbf{X}) = g^{-1}(\mathbf{X} \beta). \]

For the models above, these are:

Type of Data	Distribution Family	Link Function
continuous	Normal	identity: \[ g(x)=x \]
count	Poisson	log: \[ g(x) = \log(x) \]
binary	Bernoulli	logit: \[ g(x) = \log\left( \frac{x}{1-x} \right) \]

continuous

Normal

identity:

\[ g(x)=x \]

count

Poisson

log:

\[ g(x) = \log(x) \]

binary

Bernoulli

logit:

\[ g(x) = \log\left( \frac{x}{1-x} \right) \]

In general, the parameter vector \(\beta\) is estimated via maximizing the likelihood, i.e.,

\[ \hat \beta = \mathrm{argmax} \prod_{i=1}^n f(y_i \mid \mathbf{X}_{\cdot i}), \]

which is equivalent to the maximization of the log-likelihood, i.e.,

\[ \hat \beta = \mathrm{argmax} \sum_{i=1}^n \log f(y_i \mid \mathbf{X}_{\cdot i}), \]

In the Gaussian case, the maximum likelihood estimator is identical to the least squares estimator considered above.

using CSV
using HTTP

http_response = HTTP.get("https://vincentarelbundock.github.io/Rdatasets/csv/AER/SwissLabor.csv")
SwissLabor = DataFrame(CSV.File(http_response.body))

SwissLabor[!,"participation"] .= (SwissLabor.participation .== "yes")

model = glm(@formula(participation ~ age^2), 
            SwissLabor, Binomial(), ProbitLink())

Task 3:

Reproduce the results of our data analysis of the tree data set using a generalized linear model with normal distribution family.
Generate \(n=20\) random covariates \(\mathbf{x}\) and Poisson-distributed counting data with parameters \(\beta_0 + \beta_1 x_i\). Re-estimate the parameters by a generalized linear model.

Outlook: Linear Mixed Models

In the linear regression models so far, we assumed that the response variable \(\mathbf{y}\) depends on the design matrix of covariates \(\mathbf{X}\) - which are assumed to be given/fixed - multiplied by the so-called fixed effects coefficients \(\mathbf{X}\beta\) and independent errors \(\varepsilon\). However, in many situations, there are also random effects on several components of the response variable. These can be included in the model by adding another design matrix \(\mathbf{Z}\) multiplied by a random vector \(u\), the so-called random effects coefficients, that are assumed to be jointly normally distributed with mean vector \(0\) and variance-covariance matrix \(\Sigma\) (typically not a diagonal matrix). In matrix notation, we have the following form:

\[ \mathbf{y} = \mathbf{X} \beta + \mathbf{Z}u + \varepsilon \]

Maximizing the likelihood, we can estimate \(\beta\) and optimally predict the random vector \(u\).