Homoscedasticity vs. Heteroscedasticity
In regression analysis, two key concepts that are crucial for validating the assumptions of linear regression models are homoscedasticity and heteroscedasticity. Understanding these concepts helps in ensuring that the regression model provides reliable predictions and inference.
What is Homoscedasticity?
Homoscedasticity refers to a situation in regression analysis where the variance of the residuals (errors) is constant across all levels of the independent variable(s). This means that the scatter of the residuals does not change as the value of the independent variable changes.Example of Homoscedasticity
Imagine you are modeling the relationship between hours studied (independent variable) and exam scores (dependent variable). If the residuals (differences between observed and predicted scores) remain relatively constant across all levels of hours studied, then the data exhibits homoscedasticity.
What is Heteroscedasticity?
In contrast, heteroscedasticity occurs when the variance of the residuals varies across levels of the independent variable(s). This can lead to inefficient estimates and affects the statistical tests that rely on the assumption of homoscedasticity.Example of Heteroscedasticity
Continuing with the previous example, if the variance of the residuals increases as the number of hours studied increases (e.g., students who study more have increasingly varied exam scores), then the data is heteroscedastic. This could be visually represented by a fan-shaped spread of residuals when plotted against the independent variable.
Diagnosing Homoscedasticity and Heteroscedasticity
There are several ways to diagnose whether your data exhibits homoscedasticity or heteroscedasticity:1. Residual Plots
- Residual vs. Fitted Plot: Plot the residuals against the fitted values of the model. If the spread of residuals remains constant, you have homoscedasticity. If the spread increases or decreases, you have heteroscedasticity.`
python
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
Sample Data
np.random.seed(0) X = np.random.normal(0, 1, 100) Y = 2 * X + np.random.normal(0, 1, 100)Fit Model
model = sm.OLS(Y, sm.add_constant(X)).fit() residuals = model.resid fitted = model.fittedvaluesResidual Plot
plt.scatter(fitted, residuals) plt.axhline(y=0, color='r', linestyle='--') plt.title('Residuals vs Fitted') plt.xlabel('Fitted values') plt.ylabel('Residuals') plt.show()`
2. Statistical Tests
- Breusch-Pagan Test: This test checks for heteroscedasticity. A significant result indicates that heteroscedasticity is present. Here's how you can run it in Python:`
python
from statsmodels.stats.diagnostic import het_breuschpagan
bp_test = het_breuschpagan(residuals, model.model.exog)
print('Lagrange Multiplier statistic:', bp_test[0])
print('p-value:', bp_test[1])
`