The Four Assumptions of Linear Regression (And What Happens When You Violate Them)
Linear regression is one of the most powerful tools in applied statistics — but only when its assumptions hold. Here is how to check them, and what to do when they don't.
Linear regression is one of the oldest and most widely used tools in applied statistics. It is also one of the most frequently misused. The misuse is not usually in the fitting — modern software makes that trivial. It is in the interpretation. A regression model produces coefficients, standard errors, and p-values that are only valid if certain assumptions hold. When those assumptions are violated, the numbers are still produced — they are just wrong.
The Four Classical Assumptions (LINE)
- —Linearity: The relationship between predictors and the outcome is linear. Check with residual vs. fitted plots.
- —Independence: Observations are independent of each other. Violated by time series data, clustered data, or repeated measures.
- —Normality: The residuals are approximately normally distributed. Check with a Q-Q plot. Less critical for large samples (CLT).
- —Equal variance (Homoscedasticity): The variance of residuals is constant across all fitted values. Check with a scale-location plot.
| Assumption | Diagnostic | Remedy if Violated |
|---|---|---|
| Linearity | Residual vs. fitted plot | Add polynomial terms or use a non-linear model |
| Independence | Durbin-Watson test, ACF plot | Use mixed models, GEE, or time-series models |
| Normality | Q-Q plot, Shapiro-Wilk test | Transform outcome, use robust regression |
| Homoscedasticity | Scale-location plot, Breusch-Pagan test | Transform outcome, use WLS or robust standard errors |
"R-squared tells you how much variance your model explains. It tells you nothing about whether your model is correctly specified."
The good news is that linear regression is remarkably robust to mild violations of these assumptions, especially with large samples. The bad news is that severe violations — particularly non-linearity and heteroscedasticity — can produce coefficients and standard errors that are substantially wrong. Building the habit of running diagnostics after every regression is one of the highest-return practices in applied statistics.
Discussion
No comments yet. Be the first to start the discussion.

