What is Overfitting?
Overfitting is a common problem in machine learning and statistical modeling where a model learns not only the underlying patterns in the training data but also the noise and fluctuations. This leads to a model that performs exceptionally well on the training data but fails to generalize to unseen data, resulting in poor performance on validation or test datasets.
Understanding Overfitting
When a model is overfitted, it captures the random noise present in the training data instead of the actual signal. This results in high complexity, where the model has too many parameters relative to the number of observations. It can be likened to memorizing answers to a test rather than understanding the subject material.
Visualizing Overfitting
To illustrate overfitting, consider a simple example with a polynomial regression model:
`
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
Generate sample data
np.random.seed(0) x = np.sort(5 * np.random.rand(80, 1), axis=0) y = np.sin(x) + np.random.normal(0, 0.2, x.shape)Create polynomial features
poly = PolynomialFeatures(degree=15) x_poly = poly.fit_transform(x)Fit model
model = LinearRegression() model.fit(x_poly, y)Predict
y_pred = model.predict(x_poly)Plot
plt.scatter(x, y, color='blue', label='Data points') plt.plot(x, y_pred, color='red', label='Overfitted model') plt.title('Overfitting Example') plt.legend() plt.show()`
In this example, despite the underlying relationship being a sine wave, the polynomial model of degree 15 captures all the fluctuations in the training data, leading to a highly complex model that does not generalize well.