Overfitting and Underfitting
In the world of machine learning, understanding the concepts of overfitting and underfitting is crucial for building effective models. These two phenomena relate to how well a model generalizes to unseen data.
What is Overfitting?
Overfitting occurs when a model learns the training data too well, capturing noise along with the underlying patterns. This results in a model that performs exceptionally on the training data but poorly on new, unseen data.Characteristics of Overfitting:
- Complex Models: Overly complex models with too many parameters. - High Variance: Models that are sensitive to fluctuations in the training data. - Low Training Error / High Validation Error: The model has a significantly lower error on training data compared to validation/test data.Example of Overfitting
Consider a polynomial regression model. If we fit a high-degree polynomial to a set of points, we might get a curve that passes through every point in the training data:`
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Generate some data
np.random.seed(0) X = 6 * np.random.rand(100, 1) - 3 y = X**2 + np.random.randn(100, 1)Fit a high degree polynomial
poly_features = PolynomialFeatures(degree=15) X_poly = poly_features.fit_transform(X) lin_reg = LinearRegression() lin_reg.fit(X_poly, y)Predictions
X_new = np.linspace(-3, 3, 100).reshape(100, 1) X_new_poly = poly_features.transform(X_new) y_new = lin_reg.predict(X_new_poly)Plotting
plt.scatter(X, y, s=10) plt.plot(X_new, y_new, color='red', linewidth=2) plt.title('Overfitting Example: High-Degree Polynomial') plt.xlabel('X') plt.ylabel('y') plt.show()`
In this example, the red curve represents a high-degree polynomial that fits the training data closely, but it likely won’t perform well on new data.
What is Underfitting?
Underfitting occurs when a model is too simple to capture the underlying patterns in the data. It fails to learn the training data sufficiently and consequently performs poorly on both training and unseen data.Characteristics of Underfitting:
- Simple Models: Models that are too simplistic for the data. - High Bias: Models that make strong assumptions and are rigid in their predictions. - High Training Error: The model has a high error on training data, indicating it cannot capture the data's complexity.Example of Underfitting
Again using polynomial regression, if we fit a linear model to a quadratic relationship, it will not capture the data well:`
python
from sklearn.linear_model import LinearRegression
Fit a linear model
lin_reg_simple = LinearRegression() lin_reg_simple.fit(X, y)Predictions
y_simple_new = lin_reg_simple.predict(X_new)Plotting
plt.scatter(X, y, s=10) plt.plot(X_new, y_simple_new, color='green', linewidth=2) plt.title('Underfitting Example: Linear Model') plt.xlabel('X') plt.ylabel('y') plt.show()`
In this example, the green line represents a linear model that fails to capture the quadratic nature of the data, leading to poor performance.