What is Density-Based Clustering?
Density-Based Clustering is a powerful clustering technique that identifies groups (clusters) of data points which are closely packed together, while marking points that lie in low-density regions as outliers or noise. This method is particularly effective for identifying clusters of arbitrary shapes and handling noise, making it a popular choice in various applications such as spatial data analysis, image processing, and anomaly detection.
Key Concepts
1. Density
In the context of clustering, density is defined as the number of points in a given region of space. Density-Based Clustering algorithms work under the assumption that clusters are dense regions of points separated by regions of lower density.2. Core Points, Border Points, and Noise
- Core Points: A point is considered a core point if it has at least a minimum number of points (MinPts) within a specified radius (Eps). Core points are the backbone of a cluster. - Border Points: A point that is not a core point but is within the Eps radius of a core point. Border points help define the shape of the cluster. - Noise Points: Points that are neither core points nor border points are classified as noise and are typically discarded from the clustering process.3. Parameters: Eps and MinPts
- Eps (Epsilon): This parameter defines the radius within which to search for neighboring points. It affects the size of the clusters formed. - MinPts: This parameter specifies the minimum number of points required to form a dense region or cluster. A common rule of thumb is to set MinPts to at least the dimensionality of the data plus one.DBSCAN: A Popular Density-Based Clustering Algorithm
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is one of the most well-known density-based clustering algorithms. It uses the concepts of core points, border points, and noise to form clusters.How DBSCAN Works
1. Select an arbitrary point from the dataset. 2. Retrieve all points within the Eps radius of the selected point. 3. If the number of points retrieved is greater than or equal to MinPts, a new cluster is formed. The selected point becomes a core point. 4. If the point is a core point, expand the cluster by recursively retrieving all points within the Eps radius of each new core point. 5. Repeat the process until all points have been processed.Example
Consider a dataset of geographical locations represented as points in a 2D space. Using DBSCAN, you can identify clusters of densely populated areas, such as cities, while identifying rural areas as noise.`
python
from sklearn.cluster import DBSCAN
import numpy as np
Sample data
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])Applying DBSCAN
model = DBSCAN(eps=3, min_samples=2) model.fit(X)Results
labels = model.labels_Cluster labels
print(labels)Output: [ 0 0 0 1 1 -1] # -1 indicates noise
`
In this example, the first three points are clustered together (label 0) and the next two points are clustered together (label 1), while the last point is classified as noise (label -1).