The Four Assumptions of Linear Regression (And What Happens When You Violate Them)
Back to HomeApplied Statistics

The Four Assumptions of Linear Regression (And What Happens When You Violate Them)

Linear regression is one of the most powerful tools in applied statistics — but only when its assumptions hold. Here is how to check them, and what to do when they don't.

J
Joshua
Editor-in-Chief, Datum Daily
Feb 28, 2026
10 min read

Linear regression is one of the oldest and most widely used tools in applied statistics. It is also one of the most frequently misused. The misuse is not usually in the fitting — modern software makes that trivial. It is in the interpretation. A regression model produces coefficients, standard errors, and p-values that are only valid if certain assumptions hold. When those assumptions are violated, the numbers are still produced — they are just wrong.

The Four Classical Assumptions (LINE)

  • Linearity: The relationship between predictors and the outcome is linear. Check with residual vs. fitted plots.
  • Independence: Observations are independent of each other. Violated by time series data, clustered data, or repeated measures.
  • Normality: The residuals are approximately normally distributed. Check with a Q-Q plot. Less critical for large samples (CLT).
  • Equal variance (Homoscedasticity): The variance of residuals is constant across all fitted values. Check with a scale-location plot.
AssumptionDiagnosticRemedy if Violated
LinearityResidual vs. fitted plotAdd polynomial terms or use a non-linear model
IndependenceDurbin-Watson test, ACF plotUse mixed models, GEE, or time-series models
NormalityQ-Q plot, Shapiro-Wilk testTransform outcome, use robust regression
HomoscedasticityScale-location plot, Breusch-Pagan testTransform outcome, use WLS or robust standard errors

"R-squared tells you how much variance your model explains. It tells you nothing about whether your model is correctly specified."

The good news is that linear regression is remarkably robust to mild violations of these assumptions, especially with large samples. The bad news is that severe violations — particularly non-linearity and heteroscedasticity — can produce coefficients and standard errors that are substantially wrong. Building the habit of running diagnostics after every regression is one of the highest-return practices in applied statistics.

Topics

Discussion

No comments yet. Be the first to start the discussion.

Leave a Comment

Your email will not be published.

Newsletter

The data briefing that respects your time

Join thousands of data professionals who read Datum Daily every week. Tutorials, industry news, and curated insights — no fluff, no spam.

No spam. Unsubscribe anytime. Powered by Beehiiv.