Bootstrapping (statistics)

Bootstrapping (statistics)

In statistics, bootstrapping is a modern, computer-intensive, general purpose approach to statistical inference, falling within a broader class of resampling methods.

Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset.

It may also be used for constructing hypothesis tests. It is often used as an alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inference is impossible or requires very complicated formulas for the calculation of standard errors.

The advantage of bootstrapping over analytical methods is its great simplicity - it is straightforward to apply the bootstrap to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution, such as percentile points, proportions, odds ratio, and correlation coefficients.

The disadvantage of bootstrapping is that while (under some conditions) it is asymptotically consistent, it does not provide general finite sample guarantees, and has a tendency to be overly optimistic.Fact|date=September 2007 The apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples) where these would be more formally stated in other approaches.

Example: Fisher's iris data

To introduce the basic ideas and the value of the method, Fisher's famous Iris flower data set will be used, where only the species "virginica" and "versicolor" are considered. The analysis was performed in R.

The species can be modelled by logistic regression as a function of sepal length (that is, the other variables available are ignored). Fitting the logistic regression model by maximum likelihood gives the following parameter estimates and their standard errors:

It is known that maximum likelihood estimators are asymptotically normally distributed. We can check this assumption using a bootstrap procedure as follows:

# Sample "n" observations "with replacement" from the original data, where "n" is the number of observations.
# Fit the logistic regression model by maximum likelihood.
# Repeat this bootstrap sampling very often ("B" rounds).
# Use the sampling distribution of the estimates thus computed to be an approximation to the 'true' population sampling distribution.

The plot below contains kernel density plots of the two parameters in the model, as estimated from 10 000 bootstrap samples.

The distributions of the parameter estimates are clearly not normal. That is, the asymptotic assumptions about the maximum likelihood estimates cannot be relied on, and quantities such as confidence intervals and hypothesis tests that rely on those assumptions will be suspect.

One way to estimate confidence intervals from bootstrap samples is to take the $alpha$ and $1-alpha$ quantiles of the estimated values. These are called "bootstrap" percentile intervals. In this case, for the intercept and for sepal length, the bootstrap 95% percentile intervals are (-20.02, -7.08) and (1.26, 3.20) respectively. These can be contrasted with the asymptotic intervals derived from the maximum likelihood estimates plus or minus 1.96 standard errors: (-18.26, -6.87) and (1.10, 2.93). The intervals from the asymptotic theory are apparently too narrow (as well as being symmetric).

This simple bootstrap method is not the only way of making improved inferences over the asymptotic approach. Other bootstrap schemes are available, as are approaches based on likelihood or Bayesian considerations. (In fact, the simple bootstrap scheme used here can be quite easily criticized.)

There are more complicated bootstraps for sampling without replacement, two-sample problems, regression, time series, hierarchical sampling, mediation analyses, and other statistical problems.

How many bootstrap samples is enough?

In the example with Fisher's iris data, 10 000 bootstrap samples were used. However, no explanation was given for this number. It seems that the number of bootstrap samples recommended in the literature has increased as available computing power has increased. Whereas a few years ago, 10 000 samples would have seemed excessive, the above example ran in just a few minutes.

As a general guideline, 1000 samples is often enough for a first look. However, if the results really matter, as many samples as is reasonable given available computing power and time should be used.

Types of bootstrap scheme

In univariate problems, it is usually acceptable to resample the individual observations with replacement. However, in small samples, a parametric bootstrap approach might be preferred, and for some problems a "smooth bootstrap" will likely be preferred.

For regression problems, various other alternatives are available.

Smooth bootstrap

Under this scheme, a small amount of (usually normally distributed) zero-centered random noise is added on to each resampled observation. This is equivalent to sampling from a kernel density estimate of the data.

Parametric bootstrap

In this case, a parametric model is fit to the data, often by maximum likelihood, and samples of random numbers are drawn from this parametric model. Then, the quantity, or estimate, of interest is calculated from these samples.

Case resampling

In regression problems, "case resampling" refers to the simple scheme of resampling individual cases - often rows of a data set. For regression problems, so long as the data set is fairly large, this simple scheme is often acceptable (and this is the method used in the iris example above). However, the method is open to criticism.

In regression problems, the explanatory variables are often fixed, or at least observed with more control than the response variable. Also, the range of the explanatory variables defines the information available from them. Therefore, to resample cases means that each bootstrap sample will lose some information. As such, alternative bootstrap procedures should be considered.

Resampling residuals

Another approach to bootstrapping in regression problems is to resample residuals. The method proceeds as follows.

# Fit the model and retain the fitted values $hat y_i$ and the residual errors $epsilon_i = y_i - hat\left\{y\right\}_i, \left(i = 1,dots, n\right)$.
# For each pair, $\left(x_i, y_i\right)$, in which $x_i$ is the (possibly multivariate) explanatory variable, add a randomly resampled residual error, $epsilon_i$, to the response variable $y_i$. In other words create synthetic response variables $y^*_i = y_i + epsilon_j$ where "j" is selected randomly from the list $\left(1,dots ,n\right)$ for every "i".
# Refit the model using the fictitious response variables $y^*_i$, and retain the quantities of interest (often the parameters, $hatmu^*_i$, estimated from the synthetic $y^*_i$).
# Repeat steps 2 and 3 many, many, times.

This scheme has the advantage that it retains the information in the explanatory variables. However, a question arises as to which residuals to resample. Raw residuals are one option, another is studentized residuals (in linear regression). Whilst there are arguments in favour of using studentized residuals, in practice it often makes little difference and it is easy to run both schemes and compare the results against each other.

Wild bootstrap

This is the same as resampling residuals but with the additional step that each randomly resampled residual is randomly multiplied by 1 or -1. This method assumes that the 'true' residual distribution is symmetric and can offer advantages over simple residual sampling for smaller sample sizes.

Choice of statistic - pivoting

In situations where it is essential to extract as much information as possible from a data-set, consideration needs to be given to exactly what estimate or statistic should be the subject of the bootstrapping. Suppose inference is required about the mean of some observations. Then two possibilities are:
* generate bootstrap samples of the sample mean to construct a confidence interval for the mean;
* generate bootstrap samples of the new statistic (mean divided by sample standard deviation), construct a confidence interval for this, then derive the final confidence interval for the mean by multiplying the end-points of the initial interval by the sample standard deviation of the original sample.The results will be different, and simulations results suggest that the second approach is better. The approach may derive partly from the standard parametric approach for Normal distributions, but is rather more general. The idea is to try to make use of a pivotal quantity, or to find a derived statistic that is approximately pivotal. See also ancillary statistic.

Example applications

Application to testing for mediation

Bootstrapping is becoming the most popular method of testing mediation [http://www.comm.ohio-state.edu/ahayes/sobel.htm] [http://www.psych.ku.edu/preacher/sobel/sobel.htm] because it does not require the normality assumption to be met, and because it can be effectively utilized with smaller sample sizes (N < 20). However, mediation continues to be (perhaps inappropriately) most frequently determined using (1) the logic of Baron and Kenny [http://davidakenny.net/cm/mediate.htm] or (2) the Sobel test.

Example: smoothed bootstrap

Newcomb's speed of light data are used in the book "Bayesian Data Analysis" by Gelman et al. and can be found via the classic data sets page. Some analysis of these data appears on the robust statistics page.

The data set contains two obvious outliers so that, as an estimate of location, the median is to be preferred over the mean. Bootstrapping is a method often employed for estimating confidence intervals for medians. However, the median is a discrete statistic, and this fact shows up in the bootstrap distribution.

In order to smooth over the discreteness of the median, we can add a small amount of $N\left(0,sigma^2\right)$ random noise to each bootstrap sample. We choose $sigma = 1/sqrt n$ for sample size $n$.

Histograms of the bootstrap distribution and the smooth bootstrap distribution appear below. The bootstrap distribution is very jagged because there are only a small number of values that the median can take. The smoothed bootstrap distribution overcomes this jaggedness.

Although the bootstrap distribution of the median looks ugly and intuitively wrong, confidence intervals from it are not bad in this example. The simple 95% percentile interval is (26, 28.5) for the simple bootstrap and (25.98, 28.46) for the smoothed bootstrap.

Relationship to other resampling methods

The bootstrap is distinguished from :
* the jackknife procedure, used to estimate biases of sample statistics and to estimate variances, and
* cross-validation, used when the outcome of the basic analysis is the result of a search for the best of many possibilities, with the judgement being based on the sample of data available.

For more details see bootstrap resampling.

Bootstrap aggregating (bagging) is a meta-algorithm based on averaging the results of multiple bootstrap samples.

*cite book |title=Bootstrap Methods, A practitioner's guide |last=Chernick |first=Michael R. |year=1999 |publisher= Wiley Series in Probability and Statistics
*cite book |title= Bootstrap Methods and their Applications |last=Davison |first=A. C. |coauthors= Hinkley, D. Bootstrap Methods and their Applications. |year=1997 |publisher=Cambridge Series in Statistical and Probabilistic Mathematics |location=Cambridge [http://statwww.epfl.ch/davison/BMA/library.html software] .
*cite book |title= Bootstrap Methods and their Applications |last=Davison |first=A. C. |coauthors= Hinkley, D. Bootstrap Methods and their Applications. |year=2006 |publisher=Cambridge Series in Statistical and Probabilistic Mathematics |location=Cambridge |edition=8th
* Cite journal
author = Diaconis, P. & Efron, B.
year = 1983
title = Computer-intensive methods in statistics
journal = Scientific American
month = May
pages = 116&ndash;130

* Cite journal
author = Efron, B.
title = Bootstrap Methods: Another Look at the Jackknife
journal = The Annals of Statistics
volume = 7
issue = 1
year = 1979
pages=1&ndash;26

* Cite journal
author = Efron, B.
year = 1981
title = Nonparametric estimates of standard error: The jackknife, the bootstrap and other methods
journal = Biometrika
volume = 68
pages = 589&ndash;599
doi = 10.1093/biomet/68.3.589

* Cite book
author = Efron, B.
year = 1982
title = The jackknife, the bootstrap, and other resampling plans
publisher = Society of Industrial and Applied Mathematics CBMS-NSF Monographs
volume = 38

*cite book|author = Efron, B. |coauthors = Tibshirani, R. |title = An Introduction to the Bootstrap|publisher = Chapman & Hall/CRC|year = 1993 [http://lib.stat.cmu.edu/S/bootstrap.funs software] .
* Cite book
author = Edgington, E. S.
year = 1995
title = Randomization tests
location = New York
publisher = M. Dekker

* Hesterberg, T. C., D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein (2005): [http://bcs.whfreeman.com/ips5e/content/cat_080/pdf/moore14.pdf Bootstrap Methods and Permutation Tests] , [http://www.insightful.com/Hesterberg/bootstrap software] .
*Mooney, C Z & Duval, R D (1993). Bootstrapping. A Nonparametric Approach to Statistical Inference. Sage University Paper series on Quantitative Applications in the Social Sciences, 07-095. Newbury Park, CA: Sage
* Simon, J. L. (1997): [http://www.resample.com/content/text/index.shtml Resampling: The New Statistics] .

* [http://people.revoledu.com/kardi/tutorial/Bootstrap/index.html Bootstrap Sampling Tutorial] : Introduction to Bootstrap sampling, a tutorial using MS Excel.
* [http://www.nt.tu-darmstadt.de/nt/index.php?id=227 Bootstrap tutorial from ICASSP 99] : Tutorial from a signal processing perspective
* [http://animation.yihui.name/dmml:bootstrap_i.i.d Animations for bootstrapping i.i.d data] by Yihui Xie using the R package [http://cran.r-project.org/package=animation animation]

Wikimedia Foundation. 2010.

Поможем сделать НИР

Look at other dictionaries:

• Bootstrapping (Statistik) — Bootstrapping ist in der Statistik eine Methode des Resampling. Dabei werden wiederholt Statistiken auf der Grundlage lediglich einer Stichprobe berechnet. Verwendung finden Bootstrap Methoden, wenn die theoretische Verteilung der… …   Deutsch Wikipedia

• Bootstrapping — This is the history of bootstrapping or booting which began in the 1880s as a leather strap and evolved into a group of metaphors that share a common meaning, a self sustaining process that proceeds without external help. traps for leather… …   Wikipedia

• List of statistics topics — Please add any Wikipedia articles related to statistics that are not already on this list.The Related changes link in the margin of this page (below search) leads to a list of the most recent changes to the articles listed below. To see the most… …   Wikipedia

• Cross-validation (statistics) — Cross validation, sometimes called rotation estimation,[1][2][3] is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and… …   Wikipedia

• Imputation (statistics) — For other uses of imputation , see Imputation (disambiguation). In statistics, imputation is the substitution of some value for a missing data point or a missing component of a data point. Once all missing values have been imputed, the dataset… …   Wikipedia

• Resampling (statistics) — In statistics, resampling is any of a variety of methods for doing one of the following: # Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknife) or drawing randomly with… …   Wikipedia

• Mediation (statistics) — A simple statistical mediation model. In statistics, a mediation model is one that seeks to identify and explicate the mechanism that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of …   Wikipedia

• Statistical hypothesis testing — This article is about frequentist hypothesis testing which is taught in introductory statistics. For Bayesian hypothesis testing, see Bayesian inference. A statistical hypothesis test is a method of making decisions using data, whether from a… …   Wikipedia

• Basting Bootstrap — Le basting bootstrap est une technique particulière de bootstrapping utilisée en statistique. Elle est basée sur l inférence statistique pour estimer les propriétés d un estimateur, étant données les mesures de ces propriétés sur un échantillon… …   Wikipédia en Français

• Gene expression profiling — Heat maps of gene expression values show how experimental conditions influenced production (expression) of mRNA for a set of genes. Green indicates reduced expression. Cluster analysis has placed a group of down regulated genes in the upper left… …   Wikipedia