"Clustering Together": A Visual Guide to the K-Means Algorithm

K-Means Clustering And How It Helps Uncover Hidden Patterns In Data

May 16, 2024

Get a list of personally curated and freely accessible ML, NLP, and computer vision resources for FREE on newsletter sign-up.

Consider sharing this with someone who wants to know more about machine learning.

What is the K-Means Clustering Algorithm?

0. Why K-Means Clustering?

Imagine you're organizing a party and need to group people into different tables based on their preferences. You want each table to have people with similar tastes, so everyone has a good time. This is similar to what K-Means clustering does in the world of machine learning.

K-Means [1] is a simple yet powerful algorithm used to divide a dataset into groups, or clusters, based on similarity. It's widely used in various fields, from market segmentation to image compression.

1. What is K-Means Clustering?

K-Means clustering aims to partition data into K clusters, where each data point belongs to the cluster with the nearest mean. The algorithm iteratively assigns data points to clusters and updates the cluster centers (means) to minimize the variance within each cluster.

Motivation for K-Means: Simplicity and Effectiveness

Understanding the structure of data can reveal hidden patterns and relationships. K-Means clustering has seen a variety of applications across domains:

Customer segmentation in marketing: K-Means clustering is widely used in marketing to segment customers based on various attributes such as demographics, behavior, and purchase history. For example, in e-commerce, K-Means helps in identifying different customer segments to tailor marketing strategies and improve customer satisfaction. By grouping customers based on their purchasing behavior, businesses can create targeted marketing campaigns and optimize their product offerings to meet the needs of specific segments.
Image segmentation in computer vision [5]: In computer vision, K-Means clustering is used for image segmentation, where the goal is to partition an image into segments that are homogeneous in terms of color, intensity, or texture. This technique is useful in applications such as object detection and image analysis. By grouping pixels with similar characteristics, K-Means helps in isolating different objects within an image, making it easier to analyze and interpret visual data.
Document clustering in natural language processing [8]: In NLP, K-Means clustering is used to group similar documents together. This is particularly useful for organizing large collections of text data, such as news articles, research papers, or social media posts. By clustering documents based on their content, it becomes easier to manage and retrieve information. For instance, K-Means can help in categorizing news articles into topics such as politics, sports, and technology, allowing users to find articles on specific subjects quickly. This application is covered in various NLP resources and tutorials that explore document clustering techniques.
Finding similar images through reverse image search [6]: K-Means clustering can be used in reverse image search to find similar images by grouping images based on their feature vectors. This process involves using a pre-trained convolutional neural network (CNN) to extract feature vectors from images, which are then clustered using K-Means. When a query image is provided, its feature vector is compared to the clusters to find and retrieve similar images.
Color quantization using k-means [7]: Color quantization is a process that reduces the number of distinct colors in an image while preserving the visual appearance. K-Means clustering is commonly used for this purpose by grouping pixels into clusters based on their colors and then replacing the colors in each cluster with the centroid color. This technique helps in compressing images and reducing their size.

The objective of the K-Means is to group points into clusters.

2. How K-Means Clustering Works

Here is a step-by-step explanation:

Initialize Centroids: Choose K random points as initial cluster centers (centroids).
Assign Points to Clusters: Assign each data point to the nearest centroid.
Update Centroids: Calculate the new centroids by taking the mean of all points assigned to each cluster.
Repeat: Go back to step 1. Repeat until the centroids no longer change significantly.

K-Means finds the cluster centers and the cluster membership iteratively.

3. Example: Clustering Animals by Size and Speed

Imagine we have data on various animals, including their size and speed. We want to group them into clusters to see if we can find any patterns.

After clustering, each group should ideally represent a set of animals with similar size and speed characteristics. This can help us understand which animals share common traits and how they might relate to each other.

4. Evaluating Clusters: How good are your clusters?

Cluster quality can be measured in some ways:

Inertia: Sum of squared distances between each point and its assigned centroid. Lower inertia indicates tighter clusters.
Silhouette Score: Measures how similar each point is to its own cluster compared to other clusters. Higher scores indicate better-defined clusters.

5. Choosing K: The Right Number of Clusters

Selecting the optimal number of clusters K is crucial. It is a hyper-parameter optimization problem. Common methods include:

Elbow Method: Plot inertia against different values of K and look for the "elbow" point where the rate of decrease slows down. [2]

For each value of K, the distance of each point to its centroid is summed. The distance function can be chosen from cosine distance, Euclidean distance, and others.

Silhouette Method: Choose K that maximizes the silhouette score. [3]

Not sure where to begin? You can read the post from my recent Transformers Series below:

"Attention, Please!": A Visual Guide To The Attention Mechanism [Transformers Series]

CodeCompass

May 2, 2024

Read full story

Transformers and the Power of Positional Encoding [Transformers Series]

CodeCompass

May 9, 2024

Read full story

6. Limitations of K-Means

Sensitivity to Initialization: Different initial centroids can lead to different results. K-Means++ aims to improve this by proposing an improved initialization for the K cluster centers [9].
Assumption of Spherical Clusters: K-Means assumes clusters are spherical and equally sized, which may not always be true.
Scalability: For very large datasets, K-Means can be computationally intensive.

7. Implementing K-Means in Python

Here's a simple implementation of K-Means clustering using Python's favorite machine-learning library, scikit-learn [4]:

It would be nice, if Substack had better rendering for python code :)

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# Generate sample data
np.random.seed(0)
X = np.random.rand(100, 2)  # 100 points in 2D

# Fit K-Means model
kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

# Plot results
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Outro

K-means clustering is a powerful tool for discovering patterns in data. It groups data points into clusters based on similarity, making it easier to analyze and understand complex datasets. Despite its limitations, K-Means remains a popular choice for many clustering tasks due to its simplicity and effectiveness.

Consider subscribing to get it straight into your mailbox: