Chapter 9: Unsupervised Learning and Pattern Discovery

K-Means Clustering: Grouping Similar Data Points So far in this book, every algorithm we have mastered—from Linear Regression to Decision Trees—has relied on a teacher. In data science terms, we call these Supervised Learning algorithms. We fed the model a dataset containing both the inputs (features) and the correct answers (labels), such as "House Price," "Did Churn," or "Is Fraud." The model’s job was simply to learn the mapping between the two. But what happens when there is no teacher? What if you have a massive database of customer transactions, but no column that says "High Value" or "At Risk"? What if you have thousands of server logs but no label indicating "Error" or "Normal"? Welcome to Unsupervised Learning. In this domain, we don't predict a target variable. Instead, we ask the machine to explore the data and discover hidden structures, patterns, and groupings that we humans might miss. Our first stop in this new landscape is K-Means Clustering, one of the most popular and intuitive algorithms for finding order in chaos. The Intuition: Organizing the Unknown Imagine you have just been hired as a marketing manager for a retail chain. You are handed a spreadsheet containing data on 10,000 customers—specifically, their Annual Income and Spending Score (a metric of how often they buy). Your boss asks: "How many distinct types of customers do we have?" You can't run a Logistic Regression because you don't have a target variable to predict. You don't know if there are three types of customers or ten. You simply want to group similar customers together. A side-by-side comparison. The left panel shows a scatter plot of raw data points (Income vs Spending) all in the same color (gray), looking like a disorganized cloud. The right panel shows the same dots clustered into five distinct colors, revealing specific groups like 'Low Income/High Spend' and 'High Income/High Spend'.

A side-by-side comparison. The left panel shows a scatter plot of raw data points (Income vs Spending) all in the same color (gray), looking like a disorganized cloud. The right panel shows the same dots clustered into five distinct colors, revealing specific groups like 'Low Income/High Spend' and 'High Income/High Spend'. This is exactly what K-Means does. It looks at the distance between data points and attempts to group them into $K$ distinct clusters, where points in the same cluster are similar to each other, and points in different clusters are dissimilar. How the Algorithm Works The "K" in K-Means represents the number of clusters you want to find. The "Means" refers to the average position (the center) of the data points in that cluster. Here is the algorithm in plain English:

Initialization: You choose $K$ (e.g., 3). The algorithm places 3 points randomly on your data plot. These are called Centroids. 2. Assignment: Every single data point looks at the 3 centroids and "joins" the team of the centroid closest to it. 3. Update: Once all points have joined a team, the cluster center (centroid) is recalculated. The centroid moves to the mathematical average position of all the points in its team. 4. Repeat: Because the centroids moved, some points might now be closer to a different centroid. Steps 2 and 3 are repeated until the centroids stop moving (convergence). A 4-step diagram showing the K-Means iteration. Step 1: Random centroids appear on a scatterplot. Step 2: Points are color-coded based on the nearest centroid. Step 3: Centroids move to the geometric center of their new color groups. Step 4: The final stable state where centroids no longer move.

A 4-step diagram showing the K-Means iteration. Step 1: Random centroids appear on a scatterplot. Step 2: Points are color-coded based on the nearest centroid. Step 3: Centroids move to the geometric center of their new color groups. Step 4: The final stable state where centroids no longer move. Implementing K-Means in Python Let's apply this to our hypothetical customer dataset. We will use scikit-learn, the same library we used for regression and classification. First, we generate some synthetic data to represent our customers.

python

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler


# 1. Generate synthetic customer data
# We create 300 samples with 2 features (Income, Spending Score)
# centers=4 implies there are naturally 4 groups in this data

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

Convert to DataFrame for easier viewing

df = pd.DataFrame(X, columns=['Annual_Income', 'Spending_Score'])

2. Preprocessing: Scaling is Critical

K-Means calculates distance. If one variable is in millions (Income)

and another in single digits (Family Size), Income will dominate.

scaler = StandardScaler() X_scaled = scaler.fit_transform(df)

3. Initialize and Fit K-Means

Let's assume we want to find 4 clusters

kmeans = KMeans(n_clusters=4, random_state=42) kmeans.fit(X_scaled)

4. Get the Cluster Labels

This assigns a number (0, 1, 2, or 3) to every customer

df['Cluster_Label'] = kmeans.labels_

print(df.head()) The output will look like a standard dataframe, but with a new column: Cluster_Label. This label is the pattern the algorithm discovered. You can now query your data: "Show me the average income of Cluster 1 vs Cluster 2." The Million Dollar Question: How do we choose K? In the example above, I cheated. I told the computer to look for 4 clusters because I generated the data with 4 centers. In the real world, you won't know the answer. Should you segment your customers into 3 groups? 5 groups? 10? To solve this, we use a technique called the Elbow Method. We run the K-Means algorithm multiple times, increasing $K$ from 1 to 10. For each run, we calculate the Inertia (also known as Within-Cluster Sum of Squares). Inertia measures how tightly the data points are packed around their centroids.

Lower Inertia is better (it means clusters are tight).
However, if $K$ equals the number of data points, inertia is 0 (perfect), but the clusters are meaningless. We look for the "Elbow"—the point where adding more clusters stops giving us significant gains in compactness.

python

inertia_list = []


# Test K from 1 to 10
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertia_list.append(kmeans.inertia_)


# Plotting the Elbow
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), inertia_list, marker='o')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Inertia')
plt.show()
 A line graph representing the 'Elbow Method'. The X-axis represents 'Number of Clusters (k)' and the Y-axis represents 'Inertia'. The line drops steeply from k=1 to k=2, then bends significantly at k=4, and flattens out afterwards. The point at k=4 is highlighted as the 'Elbow'.

A line graph representing the 'Elbow Method'. The X-axis represents 'Number of Clusters (k)' and the Y-axis represents 'Inertia'. The line drops steeply from k=1 to k=2, then bends significantly at k=4, and flattens out afterwards. The point at k=4 is highlighted as the 'Elbow'. In the graph above, you would see a sharp decline in inertia that flattens out after $K=4$. That "elbow" point suggests that 4 is the optimal balance between simplicity and accuracy. Business Application: Why use K-Means? For a career transitioner, understanding the application is just as important as the code. Here is where K-Means shines in industry:

Customer Segmentation: As discussed, grouping customers by behavior to send targeted marketing campaigns (e.g., "Budget Shoppers" vs. "Big Spenders"). 2. Inventory Management: Clustering products based on sales velocity and seasonality to optimize warehouse placement. 3. Bot Detection: Clustering web traffic. Normal users usually fall into one large cluster; bots often engage in repetitive behaviors that form distinct, smaller clusters or outliers. 4. Document Classification: Grouping thousands of unlabelled support tickets into categories like "Login Issues," "Billing," or "Feature Requests" based on word usage. Limitations and Pitfalls While K-Means is a workhorse of unsupervised learning, it is not magic. Keep these limitations in mind:

Sensitivity to Outliers: One massive outlier can pull a centroid away from the main group, distorting the cluster. It is often wise to remove outliers before clustering.
Spherical Assumption: K-Means assumes clusters are round balls. If your data forms complex shapes (like a crescent moon or a ring), K-Means will fail to separate them correctly.
Scaling is Mandatory: As noted in the code, if you do not scale your data (using StandardScaler or MinMaxScaler), the feature with the largest numeric range will dictate the clusters. In the next section, we will look at Hierarchical Clustering, an alternative method that creates a "family tree" of data points, allowing us to visualize relationships without pre-selecting the number of clusters. Dimensionality Reduction: Simplifying Complex Datasets In the previous section on K-Means Clustering, we successfully grouped data points without labels. We took a dataset and asked the computer to "find the structure." This works beautifully when you are dealing with two or three variables—for example, clustering customers based on Age and Spending Score. You can easily visualize this on a 2D plot (X and Y axis) and see the clusters separate. But real-world business data is rarely that simple. Imagine you are analyzing customer behavior for a massive e-commerce platform. You aren't just looking at Age and Spending. You have data on: Time on site Number of clicks Average cart value Geographic latitude/longitude Frequency of returns Days since last login * ...and 40 other columns. You now have a 50-dimensional dataset. Not only is this impossible for the human brain to visualize, but it also creates a computational problem known as the Curse of Dimensionality. As you add more features (dimensions), the data becomes "sparse," meaning the data points are so far apart in that high-dimensional space that distance-based algorithms (like K-Means) struggle to determine what is close and what is far. To solve this, we need a way to reduce the number of columns without losing the critical information contained within them. We need Dimensionality Reduction. The Concept: Compression and Summarization Think of Dimensionality Reduction as an "Executive Summary" for your dataset. If you submit a 50-page report to a CEO, they might ask for a one-page summary. That summary doesn't just delete pages 2 through 50; it synthesizes the most important points from all 50 pages into a condensed format. In Data Science, we do this to: 1. Visualize data: We can squeeze 50 columns down to 2 or 3 so we can plot them on a chart. 2. Improve performance: Fewer columns mean faster processing for machine learning models. 3. Remove noise: It helps filter out irrelevant details, focusing the model on the signal. The most popular technique for this is Principal Component Analysis (PCA). Principal Component Analysis (PCA) PCA is a statistical procedure that converts a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called Principal Components. That sounds complex, so let’s use a visual analogy. Imagine you are holding a teapot. You want to take one photograph that best describes the shape of the teapot. Angle A (Top-down): You only see a circle (the lid). You lose the information about the handle and the spout. Angle B (Side view): You see the height, the spout, and the handle. This captures the most "variance" or information about the object. PCA effectively rotates the object (your data) in high-dimensional space to find the "best angle"—the angle that captures the most variance (information) and projects the data onto that angle. A diagram comparing two 2D projections of a 3D object (a teapot). The left projection is top-down, resulting in a simple circle (Low Information/Low Variance). The right projection is from the side, showing the spout and handle (High Information/High Variance). Arrows indicate that PCA selects the view with the highest variance.

A diagram comparing two 2D projections of a 3D object (a teapot). The left projection is top-down, resulting in a simple circle (Low Information/Low Variance). The right projection is from the side, showing the spout and handle (High Information/High Variance). Arrows indicate that PCA selects the view with the highest variance. How PCA Works (The Non-Math Version)

Standardization: First, PCA requires that all data be on the same scale. If one column is "Salary" (ranging from 30,000 to 150,000) and another is "Age" (ranging from 18 to 65), the Salary column will dominate simply because the numbers are bigger. We scale them so they compete fairly. 2. Finding the Axis of Variance: The algorithm looks for a line through the data where the data points are most spread out. This line becomes Principal Component 1 (PC1). It represents the strongest pattern in the dataset. 3. Finding the Second Axis: It then looks for a second line that is perpendicular (orthogonal) to the first one that captures the next most spread out direction. This is Principal Component 2 (PC2). 4. Repeat: This continues until we have as many components as we started with dimensions. However, the magic is that usually, the first few components (PC1 and PC2) capture 80-90% of the information. We can keep those two and discard the rest. Implementing PCA in Python Let's apply this to a dataset. We will use the famous "Wine" dataset available in Scikit-Learn. It contains chemical analysis of wines grown in Italy, with 13 different features (Alcohol, Malic acid, Ash, Magnesium, etc.). Our goal: Squash these 13 dimensions down to 2 so we can plot the wines on a scatter plot.

python

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


# 1. Load the Data
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)


# Add the target (the type of wine: 0, 1, or 2) for coloring the plot later
df['Wine_Type'] = data.target


print(f"Original Dataset Shape: {df.shape}") 
# Output will be (178, 14) - 178 rows, 13 features + 1 target


# 2. Standardize the Data (Crucial Step!)
# We separate features (X) from the target (y)
features = data.feature_names
x = df.loc[:, features].values
y = df.loc[:, ['Wine_Type']].values


# Scale features to have mean=0 and variance=1
x_scaled = StandardScaler().fit_transform(x)


# 3. Apply PCA
# We tell PCA we want to reduce down to 2 components
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x_scaled)


# Create a new DataFrame with the two components
pca_df = pd.DataFrame(data=principalComponents, columns=['PC1', 'PC2'])


# Concatenate the target variable back for visualization
final_df = pd.concat([pca_df, df[['Wine_Type']]], axis=1)


print(f"New Dataset Shape: {final_df.shape}")
# Output will be (178, 3) - 178 rows, 2 components + 1 target

We have successfully transformed a spreadsheet with 13 columns of chemical data into a dataframe with just two abstract columns: PC1 and PC2. Interpreting the Results You might ask, "What does the PC1 column represent? Is it Alcohol?" The answer is no. PC1 is a mathematical mixture of all the original 13 features combined. It might be 30% Alcohol, 20% Magnesium, and -10% Malic Acid. It is an abstract feature that represents the dominant variance in the data. Now, let's visualize the result. Remember, we couldn't visualize 13 dimensions, but we can easily visualize 2.

python

# 4. Visualize the 2D Projection
plt.figure(figsize=(10, 8))
sns.scatterplot(
    x='PC1', 
    y='PC2', 
    hue='Wine_Type', 
    palette='viridis', 
    data=final_df, 
    s=100
)
plt.title('PCA of Wine Dataset: 13 Dimensions reduced to 2', fontsize=15)
plt.xlabel('Principal Component 1', fontsize=12)
plt.ylabel('Principal Component 2', fontsize=12)
plt.show()
 A scatter plot with "Principal Component 1" on the X-axis and "Principal Component 2" on the Y-axis. There are three distinct clusters of dots colored Purple, Green, and Yellow. The clusters are relatively well-separated, showing that the data reduction preserved the differences between the wine types.

A scatter plot with "Principal Component 1" on the X-axis and "Principal Component 2" on the Y-axis. There are three distinct clusters of dots colored Purple, Green, and Yellow. The clusters are relatively well-separated, showing that the data reduction preserved the differences between the wine types. Even though we threw away 11 dimensions of data, we can see distinct clusters. This tells us that the different types of wine are chemically distinct, and PCA was able to capture those differences in just two dimensions. The "Explained Variance" Ratio How much information did we lose? PCA gives us a metric called Explained Variance Ratio. This tells us what percentage of the original dataset's information is held within our new components.

python

print(f"Explained Variance Ratio: {pca.explained_variance_ratio_}")
print(f"Total Information Retained: {sum(pca.explained_variance_ratio_) * 100:.2f}%")

Typical Output: Explained Variance Ratio: [0.3619, 0.1920] Total Information Retained: 55.41% In this example, PC1 holds 36% of the information, and PC2 holds 19%. Together, they hold about 55% of the original variance. While we lost 45% of the details, we kept enough signal to clearly distinguish the wine types in the plot above. When to Use Dimensionality Reduction As a data scientist, you will reach for PCA in these scenarios:

Exploratory Data Analysis (EDA): When you get a new dataset with 100 columns and want to see if there are obvious groups or outliers, PCA allows you to plot the "shape" of the data immediately. 2. Addressing Overfitting: If you have too many features (columns) and not enough rows of data, supervised learning models (like Logistic Regression) can get confused. Reducing dimensions helps the model focus on the general patterns rather than memorizing noise. 3. Image Processing: Images are made of pixels. A small 28x28 pixel image has 784 dimensions (columns). PCA is excellent at compressing images by finding the "principal components" of the visual shapes (like curves and loops) rather than analyzing every single pixel. Summary Dimensionality Reduction is the art of simplification. By applying PCA, we traded specific details (like the exact magnesium level of a specific wine) for a broader understanding of the dataset's structure. We have now explored how to find patterns without labels using Clustering and how to simplify complex data using Dimensionality Reduction. In the next chapter, we will shift gears completely and discuss how to handle data that isn't numbers in a spreadsheet at all—we are moving into the world of Text Analysis and Natural Language Processing. Case Study: Customer Segmentation for Targeted Marketing We have now arrived at the intersection of the technical and the strategic. In the previous two sections, we laid the groundwork: K-Means gave us the algorithmic engine to group similar data points, and Dimensionality Reduction (PCA) gave us the ability to distill complex, multi-variable data into something manageable and visual. Now, we apply these unsupervised learning techniques to one of the most universal business challenges: Marketing Strategy. Unlike the Employee Attrition case study, where we had historical data telling us exactly who left the company (Supervised Learning), we are now entering the unknown. We don't have labels like "Good Customer" or "Bad Customer." We simply have raw transaction data. Our goal is to let the algorithms discover the natural groupings within our customer base so we can move away from "spray and pray" marketing and toward data-driven personalization. The Business Problem: The "One-Size-Fits-All" Trap Imagine you are the Data Scientist for ShopRight, a mid-sized e-commerce retailer. The marketing director approaches you with a problem: "We are sending the same 15% off coupon to everyone. We’re losing money giving discounts to loyal customers who would have bought anyway, and we aren't offering enough incentives to bring back customers who haven't shopped in months." To solve this, we will perform Customer Segmentation. Specifically, we will implement a technique known as RFM Analysis, turbo-charged by K-Means clustering. The Data Strategy: RFM Analysis RFM is a classic marketing framework that quantifies customer behavior using three dimensions: 1. Recency (R): How many days has it been since the customer's last purchase? (Lower is better). 2. Frequency (F): How many times has the customer purchased? (Higher is better). 3. Monetary Value (M): What is the total amount the customer has spent? (Higher is better). By clustering customers based on these three features, we can identify distinct personas automatically. Step 1: Data Preparation and Feature Engineering In a real-world scenario, your data would live in a SQL database full of transaction logs. For this case study, let's generate a synthetic dataset that mimics a retail environment.

python

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs


# Setting a seed for reproducibility
np.random.seed(42)


# Generating synthetic customer data (Recency, Frequency, Monetary)
# We create 4 distinct "centers" of customer behavior to simulate reality
data, true_labels = make_blobs(n_samples=500, centers=4, cluster_std=1.5, n_features=3)


# Creating a DataFrame
df = pd.DataFrame(data, columns=['Recency', 'Frequency', 'Monetary'])


# Adjusting the data to look like real RFM values
# Recency: Days since last purchase (e.g., 1 to 365)
df['Recency'] = np.abs(df['Recency'] * 20) + 5 
# Frequency: Number of purchases (e.g., 1 to 50)
df['Frequency'] = np.abs(df['Frequency'] * 2) + 1
# Monetary: Total spend (e.g., $50 to $5000)
df['Monetary'] = np.abs(df['Monetary'] * 100) + 50


print(df.head())

Why can't we just feed this into K-Means immediately? Look at the scale of the numbers. Monetary values might be in the thousands (e.g., $2,500), while Frequency might be single digits (e.g., 5 purchases). K-Means uses Euclidean distance (the ruler method) to calculate similarity. If we don't fix this, the algorithm will think a difference of $100 is vastly more important than a difference of 5 purchases, simply because the number is bigger. We must standardize the data so each feature contributes equally to the distance calculation.

python

from sklearn.preprocessing import StandardScaler


scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)


# Convert back to DataFrame for easier handling later
df_scaled = pd.DataFrame(df_scaled, columns=['Recency', 'Frequency', 'Monetary'])

Step 2: Finding the Optimal Number of Segments How many customer segments do we have? 3? 5? 10? Because this is unsupervised learning, we don't know the "right" answer. We use the Elbow Method (introduced in the K-Means section) to find the sweet spot where we minimize the variance within clusters without over-complicating the model.

python

from sklearn.cluster import KMeans


inertia = []
range_val = range(1, 10)


for i in range_val:
    kmeans = KMeans(n_clusters=i, random_state=42)
    kmeans.fit(df_scaled)
    inertia.append(kmeans.inertia_)


plt.figure(figsize=(8, 4))
plt.plot(range_val, inertia, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()
 A line graph plotting "Values of K" on the X-axis (1 through 10) against "Inertia" on the Y-axis. The line drops steeply from K=1 to K=3 and then flattens out significantly after K=4, creating a distinct "elbow" shape at K=4.

A line graph plotting "Values of K" on the X-axis (1 through 10) against "Inertia" on the Y-axis. The line drops steeply from K=1 to K=3 and then flattens out significantly after K=4, creating a distinct "elbow" shape at K=4. Looking at the plot above, the "elbow" occurs around K=4. This suggests that dividing our customers into four groups gives us the best balance of cohesion and simplicity. Step 3: Building the Model and Visualizing Results Now we run the K-Means algorithm with 4 clusters.

python

# Apply K-Means with 4 clusters
kmeans = KMeans(n_clusters=4, random_state=42)
kmeans.fit(df_scaled)


# Assign the cluster labels back to our ORIGINAL (non-scaled) dataframe
df['Cluster'] = kmeans.labels_

We now have a "Cluster" column attached to every customer. But how do we visualize this? Our data is 3-dimensional (Recency, Frequency, Monetary), but our computer screens are 2-dimensional. This is where we apply the concept from the previous section: Dimensionality Reduction (PCA). We will squash the 3 dimensions down to 2 just for the sake of visualization.

python

from sklearn.decomposition import PCA
import seaborn as sns


# Reduce dimensions to 2 for plotting
pca = PCA(n_components=2)
pca_components = pca.fit_transform(df_scaled)


# Create a temporary dataframe for the plot
df_pca = pd.DataFrame(data=pca_components, columns=['PCA1', 'PCA2'])
df_pca['Cluster'] = df['Cluster']


# Plotting
plt.figure(figsize=(10, 6))
sns.scatterplot(x='PCA1', y='PCA2', hue='Cluster', data=df_pca, palette='viridis', s=100)
plt.title('Customer Segments Visualized (via PCA)')
plt.show()
 A scatter plot showing four distinct groups of colored dots (clusters). The clusters are well-separated, indicating that the K-Means algorithm successfully found different patterns in the data. The axes are labeled PCA1 and PCA2.

A scatter plot showing four distinct groups of colored dots (clusters). The clusters are well-separated, indicating that the K-Means algorithm successfully found different patterns in the data. The axes are labeled PCA1 and PCA2. Step 4: The "So What?" – Interpreting the Segments This is the most critical step for a Data Scientist. The algorithm outputs numbers (Cluster 0, 1, 2, 3). It is your job to translate those numbers into business logic. We do this by grouping our original data by the cluster ID and looking at the average values for Recency, Frequency, and Monetary.

python

# Group by cluster and calculate the mean for each feature
cluster_summary = df.groupby('Cluster').agg({
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': 'mean',
    'Cluster': 'count' # To see how many customers are in each group
}).rename(columns={'Cluster': 'Count'})


print(cluster_summary.round(2))

Note: The cluster numbers (0, 1, 2, 3) are assigned randomly, so your specific numbers might vary, but the patterns will remain. Let’s assume the output looks like the table below: | Cluster | Recency (Days) | Frequency (Count) | Monetary ($) | Count | | :--- | :--- | :--- | :--- | :--- | | 0 | 245.50 | 2.10 | 150.00 | 120 | | 1 | 15.20 | 18.50 | 2800.00 | 105 | | 2 | 45.30 | 6.20 | 650.00 | 150 | | 3 | 180.10 | 15.00 | 2100.00 | 125 | Step 5: From Data to Strategy Now we wear our marketing hats. Let’s profile these customers and define a strategy for each. Cluster 1: The "Champions" (Low Recency, High Frequency, High Monetary) Profile: These customers shopped 15 days ago, buy often, and spend the most. Strategy: Retention. Do not send them discount coupons; they are already willing to pay full price. Instead, offer them exclusive access to new products, loyalty rewards, or a "VIP" status. Make them feel special. Cluster 0: The "Lost Causes" (High Recency, Low Frequency, Low Monetary) Profile: They haven't shopped in nearly a year (245 days), rarely bought when they did, and spent very little. Strategy: Deprioritize. Don't waste marketing budget here. Perhaps send a generic automated "We miss you" email, but focus your efforts elsewhere. Cluster 3: The "At-Risk" Whales (High Recency, High Frequency, High Monetary) Profile: This is a critical group! They used to buy frequently and spend a lot ($2100), but they haven't visited in 6 months (180 days). Something happened—they churned or went to a competitor. Strategy: Win-Back Campaign. This is where you spend your budget. Send aggressive discounts, "Come back" offers, or surveys to find out what went wrong. Winning them back is high-value. Cluster 2: The "Promising" Newbies (Medium Recency, Medium Frequency) Profile: They shop reasonably often and spend a decent amount. Strategy: Upsell/Cross-sell. Recommend related products to increase their average basket size. Try to nudge them into the "Champion" category. Summary In this case study, we moved beyond simple prediction. We didn't ask the computer "Will this customer buy?" (Supervised). Instead, we asked "What kinds of customers do I have?" (Unsupervised). By using K-Means Clustering, we transformed a wall of transaction numbers into four distinct human narratives. This allows the business to move from generic marketing to targeted, high-ROI strategies. This concludes our exploration of Unsupervised Learning. In the next chapter, we will tackle a completely different beast: dealing with text data and Natural Language Processing (NLP).

Chapter 9: Unsupervised Learning and Pattern Discovery ​

Convert to DataFrame for easier viewing ​

2. Preprocessing: Scaling is Critical ​

K-Means calculates distance. If one variable is in millions (Income) ​

and another in single digits (Family Size), Income will dominate. ​

3. Initialize and Fit K-Means ​

Let's assume we want to find 4 clusters ​

4. Get the Cluster Labels ​

This assigns a number (0, 1, 2, or 3) to every customer ​

Chapter 9: Unsupervised Learning and Pattern Discovery

Convert to DataFrame for easier viewing

2. Preprocessing: Scaling is Critical

K-Means calculates distance. If one variable is in millions (Income)

and another in single digits (Family Size), Income will dominate.

3. Initialize and Fit K-Means

Let's assume we want to find 4 clusters

4. Get the Cluster Labels

This assigns a number (0, 1, 2, or 3) to every customer