How Good is the Standard Deviation = Range Divided by Four Rule?
Too lazy to calculate the standard deviation for some variable? A recent blog post suggests it might be a good idea to divide the range of the variable by four, and use that as an approximation. In simulations, this heuristics worked great for moderate sample sizes (25-70) of the beta distribution, with errors under 15%. But if you work with real data, you know that conclusions from simulated data sometimes don't apply. And besides, data are often not Beta distributed (the normal distribution is more common). The graphic below shows how this heuristic pans out for some real, freely available, datasets.
The heuristic works great for the first dataset. When there's only a slight skew and a small sample, it still does pretty well, with the error of the heuristic around 3-18% of the standard deviation. This is very much consistent with the simulations.
However, data are often more strongly skewed. In this case, the heuristic can fall apart. For example the permeability of rocks dataset, which is positively skewed, has 26% error (although the miles per gallon of 1974 cars dataset, which was also skewed, had only 3% error). So it seems if there's skew in your data, you don't quite know what you're going to get with this heuristic.
What happens for datasets with higher sample size? One dataset, which has temperatures in Nottingham, England, shows that as long as the data is well bounded (resembles a Beta distribution), the heuristic can do quite well. However, in the speed of light dataset, which is normally distributed, the percent error is 42% with a sample size of 100 (not far from what worked for the Beta distributions).
The heuristic is completely ruined by the inclusion of outliers. The last few datasets, which are not cleaned for outliers all had very high error rates (35-81%).
So the conclusion is: regardless of the sample size, be weary of simple heuristics, unless you know they work for the distribution of your variable.