Rules of thumb for minimum sample size for multiple regression

Within the context of a research proposal in the social sciences, I was asked the following question:

I have always gone by 100 + m (where m is the number of predictors) when determining minimum sample size for multiple regression. Is this appropriate?

I get similar questions a lot, often with different rules of thumb. I've also read such rules of thumb quite a lot in various textbooks. I sometimes wonder whether popularity of a rule in terms of citations is based on how low the standard is set. However, I'm also aware of the value of good heuristics in simplifying decision making.

Questions:

asked Apr 28, 2011 at 6:40 Jeromy Anglim Jeromy Anglim 45.5k 24 24 gold badges 155 155 silver badges 257 257 bronze badges

7 Answers 7

$\begingroup$

I'm not a fan of simple formulas for generating minimum sample sizes. At the very least, any formula should consider effect size and the questions of interest. And the difference between either side of a cut-off is minimal.

Sample size as optimisation problem

A Rough Rule of Thumb

In terms of very rough rules of thumb within the typical context of observational psychological studies involving things like ability tests, attitude scales, personality measures, and so forth, I sometimes think of:

These rules of thumb are grounded in the 95% confidence intervals associated with correlations at these respective levels and the degree of precision that I'd like to theoretically understand the relations of interest. However, it is only a heuristic.

G Power 3

Multiple Regression tests multiple hypotheses

Accuracy in Parameter Estimation

I also like Ken Kelley and colleagues' discussion of Accuracy in Parameter Estimation.

answered Apr 28, 2011 at 16:03 Jeromy Anglim Jeromy Anglim 45.5k 24 24 gold badges 155 155 silver badges 257 257 bronze badges $\begingroup$

I don't prefer to think of this as a power issue, but rather ask the question "how large should $n$ be so that the apparent $R^2$ can be trusted"? One way to approach that is to consider the ratio or difference between $R^2$ and $R_^$, the latter being the adjusted $R^2$ given by $1 - (1 - R^)\frac$ and forming a more unbiased estimate of "true" $R^2$.

Some R code can be used to solve for the factor of $p$ that $n-1$ should be such that $R_^$ is only a factor $k$ smaller than $R^2$ or is only smaller by $k$.

require(Hmisc) dop par(mfrow=c(1,2)) dop(c(.9, .95, .975), 'relative') dop(c(.075, .05, .04, .025, .02, .01), 'absolute') 

enter image description here

Legend: Degradation in $R^$ that achieves a relative drop from $R^$ to $R^_$ by a the indicated relative factor (left panel, 3 factors) or absolute difference (right panel, 6 decrements).

If anyone has seen this already in print please let me know.

answered May 15, 2013 at 14:34 Frank Harrell Frank Harrell 96.4k 6 6 gold badges 189 189 silver badges 436 436 bronze badges

$\begingroup$ +1. I suspect I'm missing something rather fundamental & obvious, but why should we use the ability of $\hat R^2$ to estimate $R^2$ as the criterion? We already have access to $R^2_$, even if $N$ is low. Is there a way to explain why this is the right way to think about the minimally adequate $N$ outside of the fact that it makes $\hat R^2$ a better estimate of $R^2$? $\endgroup$

Commented May 15, 2013 at 14:53

$\begingroup$ @FrankHarrell: look here the author seems to be using the plots 260-263 in much the same way as the ones in your post above. $\endgroup$

Commented May 15, 2013 at 17:35

$\begingroup$ Thanks for the reference. @gung that's a good question. One (weak) answer is that in some types of models we don't have an $R^<2>_$, and we also don't have an adjusted index if any variable selection has been done. But the main idea is that if $R^2$ is unbiased, other indexes of predictive discrimination such as rank correlation measures are likely to be unbiased also due to adequacy of the sample size and minimum overfitting. $\endgroup$

Commented May 16, 2013 at 13:46 $\begingroup$

(+1) for indeed a crucial, in my opinion, question.

In macro-econometrics you usually have much smaller sample sizes than in micro, financial or sociological experiments. A researcher feels quite well when on can provide at least feasible estimations. My personal least possible rule of thumb is $4\cdot m$ ($4$ degrees of freedom on one estimated parameter). In other applied fields of studies you usually are more lucky with data (if it is not too expensive, just collect more data points) and you may ask what is the optimal size of a sample (not just minimum value for such). The latter issue comes from the fact that more low quality (noisy) data is not better than smaller sample of high quality ones.

Most of the sample sizes are linked to the power of tests for the hypothesis you are going to test after you fit the multiple regression model.

There is a nice calculator that could be useful for multiple regression models and some formula behind the scenes. I think such a-priory calculator could be easily applied by non-statistician.

Probably K.Kelley and S.E.Maxwell article may be useful to answer the other questions, but I need more time first to study the problem.

answered Apr 28, 2011 at 12:11 Dmitrij Celov Dmitrij Celov 6,335 2 2 gold badges 30 30 silver badges 41 41 bronze badges Commented Sep 7, 2022 at 18:44 $\begingroup$

Your rule of thumb is not particularly good if $m$ is very large. Take $m=500$: your rule says its ok to fit $500$ variables with only $600$ observations. I hardly think so!

For multiple regression, you have some theory to suggest a minimum sample size. If you are going to be using ordinary least squares, then one of the assumptions you require is that the "true residuals" be independent. Now when you fit a least squares model to $m$ variables, you are imposing $m+1$ linear constraints on your empirical residuals (given by the least squares or "normal" equations). This implies that the empirical residuals are not independent - once we know $n-m-1$ of them, the remaining $m+1$ can be deduced, where $n$ is the sample size. So we have a violation of this assumption. Now the order of the dependence is $O\left(\frac\right)$. Hence if you choose $n=k(m+1)$ for some number $k$, then the order is given by $O\left(\frac\right)$. So by choosing $k$, you are choosing how much dependence you are willing to tolerate. I choose $k$ in much the same way you do for applying the "central limit theorem" - $10-20$ is good, and we have the "stats counting" rule $30\equiv\infty$ (i.e. the statistician's counting system is $1,2,\dots,26,27,28,29,\infty$).