Normality of Residuals
In regression analysis, one of the key assumptions is that the residuals (the differences between the observed and predicted values) are normally distributed. This is particularly important for linear regression, where the validity of hypothesis tests and confidence intervals depends on this assumption. In this section, we will explore the concept of the normality of residuals, its importance, how to check for it, and what to do if the assumption is violated.
What are Residuals?
Residuals are calculated as follows:
$$ Residual = Observed Value - Predicted Value $$
The residuals help us understand how well our model fits the data. Analyzing their distribution can provide insights into potential issues with our regression model.
Importance of Normality of Residuals
1. Statistical Inference: Many statistical tests, including t-tests for coefficients and F-tests for overall model fit, rely on the assumption that the residuals are normally distributed. 2. Model Accuracy: If the residuals are not normally distributed, it can indicate that the model is not capturing the underlying data patterns effectively. 3. Valid Predictions: Non-normal residuals may lead to biased predictions and incorrect confidence intervals, making your model unreliable.
Checking Normality of Residuals
There are several methods to check for the normality of residuals:
1. Visual Inspection
- Histogram: A histogram of the residuals should resemble a bell curve if they are normally distributed. - Q-Q Plot: A Quantile-Quantile plot compares the quantiles of the residuals with the quantiles of a normal distribution. If the points fall approximately along a straight line, the residuals are normally distributed.Example: Creating a Q-Q Plot in Python
`
python
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as pltSimulated data
np.random.seed(0) X = np.random.rand(100) y = 2 * X + np.random.normal(0, 0.1, 100)Fit linear regression model
import statsmodels.api as sm X = sm.add_constant(X) model = sm.OLS(y, X).fit()Get residuals
residuals = model.residQ-Q Plot
sm.qqplot(residuals, line='s') plt.title('Q-Q Plot of Residuals') plt.show()`
2. Statistical Tests
- Shapiro-Wilk Test: This test assesses whether the residuals come from a normal distribution. A p-value less than a significance level (e.g., 0.05) indicates that the residuals are not normally distributed. - Kolmogorov-Smirnov Test: This test compares the sample distribution of residuals to a normal distribution.Example: Performing the Shapiro-Wilk Test in Python
`
python
from scipy import statsShapiro-Wilk Test
shapiro_test = stats.shapiro(residuals) print('Shapiro-Wilk Test Statistic:', shapiro_test.statistic) print('p-value:', shapiro_test.pvalue)`
What to Do if Residuals are Not Normally Distributed?
If you find that the residuals are not normally distributed, consider the following options: 1. Transformation: Apply transformations (e.g., log, square root) to the dependent variable to stabilize variance and achieve normality. 2. Add Polynomial Terms: If the relationship is non-linear, adding polynomial terms or interaction terms may help. 3. Use Advanced Models: Consider using models that do not assume normality, such as generalized linear models (GLMs) or non-parametric methods.
Conclusion
Checking the normality of residuals is a vital step in regression diagnostics. Ensuring this assumption holds allows for valid statistical inference and reliable predictions. By using visual and statistical methods, you can assess the normality of residuals and take corrective actions if necessary.