Resampling-Based Statistics

Motivation: Fitting Models to Data

Situation considered yesterday: We have data and want to fit a model with certain parameters (e.g., a linear model) – we estimate the parameter.

Notation:

  • data: \(\mathbf{x} = (x_1, \ldots, x_n)\)

  • model with unknown parameter \(\theta\)

  • estimate \(\widehat \theta(\mathbf{x})\)

Working example: Fitting a normal distribution

  • data: \(\mathbf{x} = (x_1, \ldots, x_n)\)

  • model: \(\mathcal{N}(\theta, \sigma^2)\), i.e., a normal distribution with unknown mean \(\theta\) (that we want to estimate) and variance \(\sigma^2\) (that we are less interested in)

  • use empirical mean as estimator: \(\widehat \theta(\mathbf{x}) = \overline{x} = \frac 1 n \sum_{i=1}^n x_i\)


using Distributions
using Statistics
using StatsPlots

d = Normal(0.0, 1.0)
n = 100
x = rand(d, n)
ΞΈ = mean(x)

Problem: Estimator never gives the exact result – if you have random data, also the estimate is random.

Aim: Find the distribution (or at least the variance) of the estimator \(\widehat \theta\) in order to get standard errors, confidence intervals, etc.

In some easy examples, you can calculate the distribution of \(\widehat \theta\) theoretically. Example: If \(x_i\) is \(\mathcal{N}(\theta,\sigma^2)\) distributed, then the distribution of \(\widehat \theta(\mathbf{x})\) is \(\mathcal{N}(\theta, \sigma^2/n)\). Strategy: Estimate \(\sigma^2\), e.g. via the sample variance \[ \widehat \sigma^2 = \frac 1 {n-1} \sum_{i=1}^n (x_i - \overline{x})^2 \] and take the standard error, confidence intervals, etc. of the corresponding normal distribution.


Οƒ = std(x)
est_d = Normal(ΞΈ, Οƒ/sqrt(n))
plot(est_d, legend=false)

ci_bounds = quantile(est_d, [0.025,0.975])
vline!(ci_bounds)

Problem: In more complex examples, we cannot calculate the distribution.

The β€˜ideal’ solution: Generate new data

In theory, one would ideally do the following:

  1. Generate new independent data \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) (each sample of size \(n\))
  2. Apply the estimator separately to each sample \(\leadsto\) \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\)
  3. Use the empirical distribution \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\) as a proxy to the theoretical one.
B = 1000
est_vector_new = zeros(B)
for i in 1:B
  x_new = rand(d, n)
  est_vector_new[i] = mean(x_new)
end    
histogram(est_vector_new, legend=false)

ci_bounds_new = quantile(est_vector_new, [0.025, 0.975])
vline!(ci_bounds_new)

But: In most real world situation, we can not generate new data if the distribution is unknown. We have to work with the data we have …

The practical solution: Resampling / Bootstrap

Idea: Use samples \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) that are not completely new, but obtained from resampling the original data \(\mathbf{x}\).

Question: How can one obtained another sample of the same size \(n\)? \(\leadsto\) (re-)sampling with replacement

The overall procedure is as follows:

  1. Generate \(B\) samples \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) of size \(n\) by independently resampling from \(\mathbf{x}\) with replacement.
  2. Apply the estimator separately to each sample \(\leadsto\) \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\)
  3. Use the empirical distribution \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\) as a proxy to the theoretical one.

est_vector_bs = zeros(B)
for i in 1:B
  x_bs = rand(x, n)
  est_vector_bs[i] = mean(x_bs)
end
histogram(est_vector_bs, legend=false)

ci_bounds_bs = quantile(est_vector_bs, [0.025, 0.975])
vline!(ci_bounds_bs) 

If the sample \(\mathbf{x} = (x_1,\ldots,x_n)\) consists of independent and identically distributed data, the resampling procedure often provides a code proxy to the true (unknown) distribution of the estimator.

Note

The above resampling procedure is called bootstrap (from β€˜β€™To pull oneself up by one’s bootstraps.’’) as only data are used that are already available.

Task 1
  1. Reconsider the tree data set and the simple linear regression model Volume ~ Girth. Calculate a 95% confidence interval for \(\beta_1\) via bootstrap and compare to the Julia output of the linear model.
  2. Use bootstrap to estimate the standard error for the predicted volume of a tree with Girth=10 the output above.

Problem: If the data are not independent, the above (i.i.d.) bootstrap samples would have a misspecified dependence structure and therefore lead to a bad uncertainty estimate. For some situations, there are specific modifications of the bootstrap procedure (e.g. block bootstrap for time series), but they tend to work well only if dependence is sufficiently weak.

Parametric Bootstrap

There are situations where it is hardly possible to construct reasonable confidence intervals or estimate the standard error. But one could at least get a rough guess of the uncertainty by the following thought experiment:

Assume that the estimated parameter value \(\theta^*\) would be equal to the true one. How uncertain would an estimate be in that case?

The answer is given by the following procedure, called parametric bootstrap:

  1. Generate independent data \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) (each sample of size \(n\)) from the model with parameter \(\theta^*\)
  2. Apply the estimator separately to each sample \(\leadsto\) \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\)
  3. Use the empirical distribution \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\) as a proxy to the theoretical one.
Task 2
  1. Consider the following function that generate \(n\) correlated samples that are uniformly distributed on \([\mu-0-5,\mu+0.5]\).
myrand = function(mu, n) 
  rho = 0.9
  res = zeros(n)
  res[1] = rand(1)[1]
  if n > 1
    for i in 2:n
      res[i] = rho*res[i-1] .+ (1-rho)*rand(1)[1]
    end
  end
  res .= mu - 0.5 .+ res
  return(res)
end 

The additional parameter rho (between 0 and 1) controls the strength of dependence with 0 meaning independence and 1 meaning full dependence.

Write functions that estimate the standard deviation of the estimated mean via (a) generating new samples from the true unknown distribution, (b) i.i.d. bootstrap, (c) parametric bootstrap 2. Use the functions for different values of rho and compare the results.