Resampling-Based Statistics
Motivation: Fitting Models to Data
Situation considered yesterday: We have data and want to fit a model with certain parameters (e.g., a linear model) β we estimate the parameter.
Notation:
data: \(\mathbf{x} = (x_1, \ldots, x_n)\)
model with unknown parameter \(\theta\)
estimate \(\widehat \theta(\mathbf{x})\)
Working example: Fitting a normal distribution
data: \(\mathbf{x} = (x_1, \ldots, x_n)\)
model: \(\mathcal{N}(\theta, \sigma^2)\), i.e., a normal distribution with unknown mean \(\theta\) (that we want to estimate) and variance \(\sigma^2\) (that we are less interested in)
use empirical mean as estimator: \(\widehat \theta(\mathbf{x}) = \overline{x} = \frac 1 n \sum_{i=1}^n x_i\)
using Distributions
using Statistics
using StatsPlots
= Normal(0.0, 1.0)
d = 100
n = rand(d, n)
x = mean(x) ΞΈ
Problem: Estimator never gives the exact result β if you have random data, also the estimate is random.
Aim: Find the distribution (or at least the variance) of the estimator \(\widehat \theta\) in order to get standard errors, confidence intervals, etc.
In some easy examples, you can calculate the distribution of \(\widehat \theta\) theoretically. Example: If \(x_i\) is \(\mathcal{N}(\theta,\sigma^2)\) distributed, then the distribution of \(\widehat \theta(\mathbf{x})\) is \(\mathcal{N}(\theta, \sigma^2/n)\). Strategy: Estimate \(\sigma^2\), e.g. via the sample variance \[ \widehat \sigma^2 = \frac 1 {n-1} \sum_{i=1}^n (x_i - \overline{x})^2 \] and take the standard error, confidence intervals, etc. of the corresponding normal distribution.
= std(x)
Ο = Normal(ΞΈ, Ο/sqrt(n))
est_d plot(est_d, legend=false)
= quantile(est_d, [0.025,0.975])
ci_bounds vline!(ci_bounds)
Problem: In more complex examples, we cannot calculate the distribution.
The βidealβ solution: Generate new data
In theory, one would ideally do the following:
- Generate new independent data \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) (each sample of size \(n\))
- Apply the estimator separately to each sample \(\leadsto\) \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\)
- Use the empirical distribution \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\) as a proxy to the theoretical one.
= 1000
B = zeros(B)
est_vector_new for i in 1:B
= rand(d, n)
x_new = mean(x_new)
est_vector_new[i] end
histogram(est_vector_new, legend=false)
= quantile(est_vector_new, [0.025, 0.975])
ci_bounds_new vline!(ci_bounds_new)
But: In most real world situation, we can not generate new data if the distribution is unknown. We have to work with the data we have β¦
The practical solution: Resampling / Bootstrap
Idea: Use samples \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) that are not completely new, but obtained from resampling the original data \(\mathbf{x}\).
Question: How can one obtained another sample of the same size \(n\)? \(\leadsto\) (re-)sampling with replacement
The overall procedure is as follows:
- Generate \(B\) samples \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) of size \(n\) by independently resampling from \(\mathbf{x}\) with replacement.
- Apply the estimator separately to each sample \(\leadsto\) \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\)
- Use the empirical distribution \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\) as a proxy to the theoretical one.
= zeros(B)
est_vector_bs for i in 1:B
= rand(x, n)
x_bs = mean(x_bs)
est_vector_bs[i] end
histogram(est_vector_bs, legend=false)
= quantile(est_vector_bs, [0.025, 0.975])
ci_bounds_bs vline!(ci_bounds_bs)
If the sample \(\mathbf{x} = (x_1,\ldots,x_n)\) consists of independent and identically distributed data, the resampling procedure often provides a code proxy to the true (unknown) distribution of the estimator.
The above resampling procedure is called bootstrap (from ββTo pull oneself up by oneβs bootstraps.ββ) as only data are used that are already available.
- Reconsider the
tree
data set and the simple linear regression modelVolume ~ Girth
. Calculate a 95% confidence interval for \(\beta_1\) via bootstrap and compare to the Julia output of the linear model. - Use bootstrap to estimate the standard error for the predicted volume of a tree with
Girth=10
the output above.
Problem: If the data are not independent, the above (i.i.d.) bootstrap samples would have a misspecified dependence structure and therefore lead to a bad uncertainty estimate. For some situations, there are specific modifications of the bootstrap procedure (e.g. block bootstrap for time series), but they tend to work well only if dependence is sufficiently weak.
Parametric Bootstrap
There are situations where it is hardly possible to construct reasonable confidence intervals or estimate the standard error. But one could at least get a rough guess of the uncertainty by the following thought experiment:
Assume that the estimated parameter value \(\theta^*\) would be equal to the true one. How uncertain would an estimate be in that case?
The answer is given by the following procedure, called parametric bootstrap:
- Generate independent data \(\mathbf{x}^{(1)}, \mathbf{x}^{(2)}, \ldots, \mathbf{x}^{(B)}\) (each sample of size \(n\)) from the model with parameter \(\theta^*\)
- Apply the estimator separately to each sample \(\leadsto\) \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\)
- Use the empirical distribution \(\widehat \theta(\mathbf{x}^{(1)}), \ldots, \widehat \theta(\mathbf{x}^{(B)})\) as a proxy to the theoretical one.
- Consider the following function that generate \(n\) correlated samples that are uniformly distributed on \([\mu-0-5,\mu+0.5]\).
= function(mu, n)
myrand = 0.9
rho = zeros(n)
res 1] = rand(1)[1]
res[if n > 1
for i in 2:n
= rho*res[i-1] .+ (1-rho)*rand(1)[1]
res[i] end
end
.= mu - 0.5 .+ res
res return(res)
end
The additional parameter rho
(between 0 and 1) controls the strength of dependence with 0 meaning independence and 1 meaning full dependence.
Write functions that estimate the standard deviation of the estimated mean via (a) generating new samples from the true unknown distribution, (b) i.i.d. bootstrap, (c) parametric bootstrap 2. Use the functions for different values of rho
and compare the results.