Bootstrap and Monte Carlo Simulation

2 minute read

Published:

Methods: Bootstrap, Monte Carlo Simulation

I found myself perplexed by the difference and relationship between Bootstrap and Monte Carlo Simulation. Then, I read Comparing Groups: Randomization and Bootstrap Methods Using R and it clearly explains these two methods. This book uses simple language to explain intricate statistical concepts,offering concrete examples with orgnized code. It also introduced about effectively presenting statistical findings in research papers. This book is highly beneficial for individuals inclined toward statistical analysis within the realm of social sciences. The primary content of this blog post draws heavily from insights gleaned from this book.

The bootstrap methodology uses Monte Carlo simulation to resample many replicate data sets from a probability model assumed to underlie the population, or from a model that can be estimated from the data. (p.140)

I think both Bootstrap and Monte Carlo Simulation are resampling methods. Monte Carlo Simulation resamples in a random way, while Bootstrap resamples according to the empirical distribution of the data.

Example

The dataset Latino has 150 observations with two columns. Column 1 is “Mex” which suggests whether the person is from Mexico, and Column 2 is “Achieve” which is the level of the person’s fluency in English. In the dataset, 116 people of them are from Mexico and the other 34 are not.

H0 Hypothesis: People from and not from Mexico have the same level of English. In other words, the average difference of their English level is not statistically significant

Monte Carlo Simulation

Step 1: Resample 5000 times

permuted <- replicate(n = 4999, expr = sample(latino$Achieve))

Step 2: For the resampled dataset, calculate the difference in means of the two groups

mean.diff <- function(data) {
  mean(data [1:34]) - mean(data[35:150]) 
  }
diffs <- apply(X = permuted, MARGIN = 2, FUN = mean.diff)

Step3:Calculate p-value

(length(diffs[abs(diffs) >= 0.39])+1) /5000 
# 0.39 is the group difference in mean in the original dataset.

Bootstrap

Step 1: Resample 5000 times under empirical distribution

Step 2: For the resampled dataset, calculate the difference in means of the two groups

Step3:Calculate p-value

library(boot)
mean.diff.np <- function(data, indices) {
  d <- data[indices, ]
  mean(d$Achieve[1:34]) - mean(d$Achieve[35:150])
 }  
nonpar.boot <- boot(data = latino, statistic = mean.diff.np, R = 4999)
(length(par.boot$t[abs(par.boot$t) >= 0.39])+1)/5000

References

Zieffler, Harring, Long (2011). Comparing Groups: Randomization and Bootstrap Methods Using R. Wiley.