Clustering Techniques
Clustering is a fundamental technique in data analysis that involves grouping a set of objects in such a way that objects in the same group (or cluster) are more similar to each other than to those in other groups. This technique is widely used in various fields such as marketing, biology, and social sciences to identify patterns and structures in data.
1. Understanding Clustering
Clustering is an unsupervised learning technique, meaning that it does not rely on labeled outcomes. Instead, it seeks to infer the natural structure present within a set of data points. The main objective is to organize data into meaningful groups without prior knowledge of the group definitions.
1.1 Key Concepts
- Centroid: The center point of a cluster, often used in algorithms like K-means. - Distance Metric: A method for measuring the distance between data points (e.g., Euclidean distance, Manhattan distance). - Inertia: A measure of how tightly clustered the data points are within the cluster.2. Common Clustering Techniques
2.1 K-Means Clustering
K-Means is one of the most popular clustering algorithms. It partitions the dataset into K clusters, where each data point belongs to the cluster with the nearest mean.Steps:
1. Choose the number of clusters K. 2. Initialize K centroids randomly. 3. Assign each data point to the nearest centroid. 4. Recalculate the centroids based on the current cluster assignment. 5. Repeat steps 3 and 4 until convergence (i.e., the centroids do not change significantly).Example Code (Python):
`
python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeansSample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])K-Means clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)Plotting the clusters
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, s=50, cmap='viridis') plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='red', s=200, alpha=0.75, marker='X') plt.title('K-Means Clustering') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()`
2.2 Hierarchical Clustering
Hierarchical clustering builds a hierarchy of clusters either in a bottom-up (agglomerative) or top-down (divisive) manner. This technique is useful for understanding data at different levels of granularity.Agglomerative Approach:
1. Treat each data point as a single cluster. 2. Merge the closest pair of clusters. 3. Repeat until only one cluster remains.Example Code (Python):
`
python
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as pltSample data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])Create the dendrogram
linkage_matrix = sch.linkage(X, 'ward') plt.figure(figsize=(10, 7)) sch.dendrogram(linkage_matrix) plt.title('Hierarchical Clustering Dendrogram') plt.xlabel('Sample index') plt.ylabel('Distance') plt.show()`
2.3 DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a clustering algorithm that groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions. It is particularly useful for identifying clusters of varying shapes and sizes.Key Parameters:
- Epsilon (eps): The maximum distance between two samples for them to be considered as in the same neighborhood. - MinPts: The number of samples in a neighborhood for a point to be considered as a core point.Example Code (Python):
`
python
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as pltSample data
X = np.array([[1, 2], [1, 2.5], [1, 3], [5, 8], [8, 8], [8, 9]])DBSCAN clustering
dbscan = DBSCAN(eps=1, min_samples=2).fit(X)Plotting the clusters
plt.scatter(X[:, 0], X[:, 1], c=dbscan.labels_, s=50, cmap='viridis') plt.title('DBSCAN Clustering') plt.xlabel('Feature 1') plt.ylabel('Feature 2') plt.show()`