Chapter 8: Classification Algorithms for Categorical Outcomes
Logistic Regression: Predicting Binary Outcomes In the previous chapters, we built a foundation for Regression: predicting a continuous number. We answered questions like "What will the house price be?" or "How many units will we sell next quarter?" But in the business world, not every question is about "How much?" Often, the most critical questions are about "Which one?" or "Yes or No?"
- Will this customer churn? (Yes/No)
- Is this transaction fraudulent? (Yes/No)
- Will this lead convert into a sale? (Yes/No) This brings us to the second pillar of Supervised Learning: Classification. Specifically, we will look at the algorithm used to predict binary outcomes: Logistic Regression. Despite its confusing name (it includes the word "Regression"), Logistic Regression is used strictly for Classification. It is the industry standard for calculating the probability of an event occurring. The Problem with Linear Regression for Classification To understand why we need a new algorithm, let’s imagine we are trying to predict if a customer will buy a new product based on their age. In our dataset, the outcome ($y$) is binary: $0$ = Did not buy $1$ = Bought If we attempted to use the Simple Linear Regression model we mastered in previous sections, we would try to draw a straight line through this data. A scatter plot showing 'Age' on the X-axis and 'Purchased' on the Y-axis. The data points are clustered strictly at Y=0 and Y=1. A straight blue regression line attempts to cut through the data diagonally, extending below 0 and above 1.
A scatter plot showing 'Age' on the X-axis and 'Purchased' on the Y-axis. The data points are clustered strictly at Y=0 and Y=1. A straight blue regression line attempts to cut through the data diagonally, extending below 0 and above 1. There are two major problems with applying a straight line here: 1. The bounds are broken: A straight line extends to infinity in both directions. For a very old customer, the model might predict a value of $1.5$. For a very young customer, it might predict $-0.4$. In the context of probability, what does "150% chance of buying" or "-40% chance of buying" mean? It is mathematically impossible. 2. The relationship isn't linear: In binary decisions, the change often happens quickly around a threshold. A small increase in age might not matter much until a tipping point is reached, at which point the likelihood of purchase spikes. We need a model that bounds our output between 0 and 1 and handles that "tipping point." The Solution: The Sigmoid Function To solve this, Logistic Regression takes the straight line equation ($y = mx + b$) and wraps it inside a transformation function called the Sigmoid Function (also known as the Logistic Function). Without getting bogged down in the calculus, the Sigmoid function acts like a squashing machine. It takes any number—no matter how large or small—and squashes it into a value between 0 and 1. Visually, this turns our straight line into an S-curve. A graph showing the Sigmoid S-curve. The X-axis represents the input (e.g., Age), and the Y-axis represents Probability ranging from 0.0 to 1.0. The curve starts flat near 0, rises steeply in the middle, and flattens out near 1.
A graph showing the Sigmoid S-curve. The X-axis represents the input (e.g., Age), and the Y-axis represents Probability ranging from 0.0 to 1.0. The curve starts flat near 0, rises steeply in the middle, and flattens out near 1. This S-curve is perfect for probabilities. If the input is very low, the curve flattens at 0 (0% probability). If the input is very high, the curve flattens at 1 (100% probability). * The steep slope in the middle represents the transition zone where the outcome is uncertain. From Probability to Prediction When you run a Logistic Regression model, the raw output is a Probability Score. Customer A: 0.92 (92% likely to buy) Customer B: 0.15 (15% likely to buy) * Customer C: 0.51 (51% likely to buy) However, a computer needs to make a binary decision. To convert this probability into a class label ($0$ or $1$), we apply a Threshold (also called a decision boundary). The default threshold in Scikit-Learn is 0.5. If Probability $> 0.5$: Predict Class 1 (Yes) If Probability $\leq 0.5$: Predict Class 0 (No) Note: In advanced business applications, you might tune this threshold. For example, in cancer detection, you might lower the threshold to 0.1 because you want to flag anything that is even remotely suspicious. But for now, we will stick to the default. Implementation in Scikit-Learn Let's apply this to a dataset. We will generate a synthetic dataset representing customers, their age, and whether they purchased a specific insurance policy. The workflow remains the same as it was for Linear Regression: Instantiate, Fit, Predict.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# 1. Create sample data (Age vs. Purchased)
# Synthetic data: Older people are more likely to buy
data = {
'Age': [22, 25, 47, 52, 46, 56, 55, 60, 62, 61, 18, 28, 27, 29, 49],
'Purchased': [0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1]
}
df = pd.DataFrame(data)
# 2. Split features (X) and target (y)
X = df[['Age']]
y = df['Purchased']
# 3. Train/Test SplitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
4. Instantiate the model
Note: We use LogisticRegression, not LinearRegression
log_reg = LogisticRegression()
5. Fit the model
log_reg.fit(X_train, y_train)
6. Make Predictions
predictions = log_reg.predict(X_test)
print(f"Test Set Ages: \n{X_test.values.flatten()}") print(f"Predictions (0=No, 1=Yes): {predictions}") Output:
Test Set Ages:
[28 22 25]Predictions (0=No, 1=Yes): [0 0 0] In this small test set, the model predicted '0' (No purchase) for all three young customers. This aligns with the trend in our data. Peeking Under the Hood: predict_proba As a Data Scientist, the binary prediction is useful, but the probability is often more valuable for business strategy. Knowing a customer is 51% likely to leave is very different from knowing they are 99% likely to leave, even though both result in a prediction of "Churn." Scikit-Learn allows us to see these probabilities using the method .predict_proba().
# Predict probabilities for the test set
probabilities = log_reg.predict_proba(X_test)
# Display nicely
results = pd.DataFrame(probabilities, columns=['Prob_No', 'Prob_Yes'])
results['Age'] = X_test.values
results['Predicted_Class'] = predictions
print(results)Output:
```textProb_No Prob_Yes Age Predicted_Class 0 0.8124 0.1876 28 0 1 0.9045 0.0955 22 0 2 0.8650 0.1350 25 0 Here, we can see the nuance. The 22-year-old had a 9.5% probability of buying (Prob_Yes), while the 28-year-old had an 18.7% probability. Neither crossed the 50% threshold, so both were classified as 0. A visualization of the output dataframe above. Arrows point from the 'Prob_Yes' column to the 'Predicted_Class' column, illustrating that since Prob_Yes < 0.5, the Class is 0.
A visualization of the output dataframe above. Arrows point from the 'Prob_Yes' column to the 'Predicted_Class' column, illustrating that since Prob_Yes < 0.5, the Class is 0. Interpreting Coefficients in Logistic Regression In Linear Regression, we learned that the coefficient told us exactly how much $y$ increased for every unit of $x$. In Logistic Regression, interpretation is slightly more complex because of the Sigmoid transformation. The coefficients represent the log-odds, which is not intuitive for most business stakeholders. However, the sign (positive or negative) of the coefficient is immediately useful:
- Positive Coefficient: As the feature increases, the probability of the event (1) increases. 2. Negative Coefficient: As the feature increases, the probability of the event (1) decreases.
print(f"Coefficient for Age: {log_reg.coef_[0][0]}")If this prints a positive number (e.g., 0.15), it confirms that as Age goes up, the likelihood of Purchasing goes up. Summary We have now moved from predicting values (Linear Regression) to predicting probabilities and classes (Logistic Regression).
- Linear Regression is for continuous outcomes (Price, Temperature, Sales).
- Logistic Regression is for binary categories (Yes/No, True/False).
- The Sigmoid function transforms the output into a probability between 0 and 1.
- A Threshold (usually 0.5) turns that probability into a hard decision. However, making a prediction is only half the battle. How do we know if our classification model is actually good? In Regression, we looked at the error distance (RMSE). In Classification, "distance" doesn't make sense—you are either right or wrong. In the next section, we will explore the Confusion Matrix and Accuracy Scores to evaluate the performance of our classification models. Decision Trees: Mapping Logic to Predictions In the previous section, we explored Logistic Regression, a powerful method for predicting binary outcomes (Yes/No). We learned that despite its name, it is a classification algorithm that draws an "S-curve" (the Sigmoid function) to separate classes based on probability. While Logistic Regression is fantastic for understanding relationships (e.g., "How does increasing price affect the probability of a sale?"), it has a distinct limitation: it assumes a mathematical, linear relationship between the features and the log-odds of the outcome. But human decision-making rarely follows a smooth mathematical curve. When you decide whether to wear a coat, you don't calculate a probability coefficient. You follow a set of logic rules: 1. Is it raining? If Yes -> Wear a coat. 2. If No, is it below 60 degrees? If Yes -> Wear a coat. 3. If No -> Don't wear a coat. This logic—a series of sequential questions leading to a conclusion—is the foundation of the Decision Tree. The Intuition: The "Flowchart" Model If you have ever followed a Standard Operating Procedure (SOP) or a troubleshooting guide at work, you have manually executed a Decision Tree. In Data Science, a Decision Tree is a supervised learning algorithm that splits your data into smaller and smaller subsets based on specific criteria. It "grows" an upside-down tree structure:
- The Root Node: The starting point containing the entire dataset. 2. Decision Nodes: Points where the data is split based on a specific variable (e.g., "Income > $50k"). 3. Leaf Nodes: The endpoints where a final prediction is made. A diagram of a decision tree structure. The top box is labeled 'Root Node (All Data)'. Arrows branch out to 'Decision Nodes' containing questions like 'Credit Score > 700?'. The bottom boxes are labeled 'Leaf Nodes' containing the final classifications 'Approve Loan' and 'Deny Loan'.
A diagram of a decision tree structure. The top box is labeled 'Root Node (All Data)'. Arrows branch out to 'Decision Nodes' containing questions like 'Credit Score > 700?'. The bottom boxes are labeled 'Leaf Nodes' containing the final classifications 'Approve Loan' and 'Deny Loan'. Unlike the "Black Box" nature of some advanced algorithms (where the math is so complex it is hard to explain why a prediction was made), Decision Trees are White Box models. They are completely transparent. If your boss asks, "Why did the model reject this loan application?", you can trace the exact path down the tree: "Because the applicant's income was low AND their debt-to-income ratio was high." How the Algorithm "Grows" As a human, you might use intuition to decide which question to ask first. The computer, however, needs a metric. When the algorithm looks at your training data, it attempts to find the feature that best separates the target classes. Imagine a bucket containing 10 blue balls and 10 red balls. The bucket is "impure" (a 50/50 mix). The goal of the algorithm is to find a way to pour these balls into two new buckets such that the new buckets are as "pure" as possible (e.g., one bucket has mostly red, the other mostly blue). To do this, Scikit-Learn uses a metric called Gini Impurity (or sometimes Entropy).
- The model looks at every single feature (e.g., Age, Income, Debt). 2. It tests every possible split (e.g., Age > 20, Age > 21, Age > 22...). 3. It calculates which split results in the highest "purity" (the most homogenous groups) in the resulting child nodes. 4. It repeats this process recursively for every child node until the leaves are pure or a stopping condition is met. A visual representation of splitting a 2D scatter plot. On the left, a plot with mixed red circles and blue squares. On the right, vertical and horizontal lines divide the plot into rectangular regions, isolating the red circles from the blue squares, illustrating how decision boundaries are created.
A visual representation of splitting a 2D scatter plot. On the left, a plot with mixed red circles and blue squares. On the right, vertical and horizontal lines divide the plot into rectangular regions, isolating the red circles from the blue squares, illustrating how decision boundaries are created. Implementation in Scikit-Learn Let's apply this to a relatable scenario: Employee Retention. We want to predict if an employee will leave the company (Attrition = 1) or stay (Attrition = 0) based on their Satisfaction_Level (0 to 1) and Years_at_Company. We adhere to our modeling workflow: Instantiate, Fit, Predict.
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# 1. Setup Dummy Data
data = {
'Satisfaction_Level': [0.9, 0.1, 0.8, 0.2, 0.85, 0.15, 0.4, 0.6],
'Years_at_Company': [5, 2, 6, 2, 10, 3, 2, 4],
'Attrition': [0, 1, 0, 1, 0, 1, 1, 0] # 0 = Stay, 1 = Leave
}
df = pd.DataFrame(data)
# 2. Define Features (X) and Target (y)X = df[['Satisfaction_Level', 'Years_at_Company']] y = df['Attrition']
3. Instantiate the model
We set max_depth to prevent the tree from becoming too complex (more on this later)
tree_model = DecisionTreeClassifier(random_state=42, max_depth=3)
4. Fit the model
tree_model.fit(X, y)
5. Predict
Let's predict for a new employee with Low Satisfaction (0.15) and 3 Years experience
new_employee = [[0.15, 3]] prediction = tree_model.predict(new_employee)
print(f"Prediction (0=Stay, 1=Leave): {prediction[0]}") Visualizing the Logic One of the massive advantages of Decision Trees is that we can visualize the model logic directly without needing complex statistical interpretations. Scikit-Learn provides a tool to draw the tree we just built.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plot_tree(tree_model,
feature_names=['Satisfaction', 'Years'],
class_names=['Stay', 'Leave'],
filled=True)
plt.show()
An output of the plot_tree function. The root node at the top shows a split condition 'Satisfaction <= 0.5'. The nodes are colored, with shades of orange representing the 'Leave' class and shades of blue representing the 'Stay' class. The color intensity indicates the purity of the node.An output of the plot_tree function. The root node at the top shows a split condition 'Satisfaction <= 0.5'. The nodes are colored, with shades of orange representing the 'Leave' class and shades of blue representing the 'Stay' class. The color intensity indicates the purity of the node. When you run this code, you will see exactly how the machine is thinking. It likely noticed that Satisfaction_Level was the most important predictor and placed it at the top (the root). If satisfaction is low, it predicts attrition; if high, it predicts retention. The Danger Zone: Overfitting You might be thinking, "If I let the tree grow forever, won't it eventually classify every single training point correctly?" Yes, and that is a problem. If you don't limit the growth of the tree, the algorithm will create specific rules for outliers. It essentially memorizes the training data rather than learning the general patterns. This is called Overfitting. An overfitted tree might look at a specific employee and create a rule: "If Satisfaction is 0.612 AND Years is 4 AND Last Name starts with Z, then Leave." This works for the history books (training data), but it will fail miserably on new data (testing data) because that rule is just noise, not a trend. To prevent this, we use Hyperparameter Tuning. The most common controls are: max_depth: Limits how deep the tree can grow (e.g., only ask 3 questions). min_samples_split: Requires a certain amount of data in a node before allowed to split again (e.g., don't create a rule for just 2 people). Summary Decision Trees offer a refreshing change from the algebraic equations of Regression. They map logic in a way that mirrors human thought, making them exceptionally easy to explain to stakeholders. Pros: Interpretability: Easy to explain to non-technical audiences. Non-Linearity: Can capture complex, non-linear patterns (like "If income is high OR income is low, but not medium..."). Minimal Prep:* Requires less data cleaning (e.g., no need to scale/normalize features) compared to Regression. Cons: Overfitting: Without constraints (max_depth), they memorize noise. Instability: A small change in the data can result in a completely different tree structure. Because single Decision Trees are prone to overfitting and instability, data scientists rarely rely on just one tree for critical production models. Instead, they grow hundreds of trees and average their predictions. This leads us to the concept of Ensemble Modeling and the famous Random Forest, which we will discuss in the next chapter. Evaluating Model Performance: Confusion Matrix, Precision, and Recall In the previous sections, we added powerful tools to your arsenal: Logistic Regression and Decision Trees. You now possess the ability to predict binary outcomes—whether a customer will churn, whether a loan will default, or whether a transaction is fraudulent. However, simply building a model is not enough. In a business setting, you must answer the inevitable stakeholder question: "How good is this model, really?" Your instinct might be to answer with Accuracy—the percentage of correct predictions. While intuitive, accuracy can be the most dangerous metric in Data Science. To understand why, let’s imagine you are building a fraud detection system for a bank. The Accuracy Paradox Suppose you analyze a dataset of 1,000 credit card transactions. In reality, 990 are legitimate, and only 10 are fraudulent. If you wrote a "dumb" model that simply predicted "Legitimate" for every single transaction—ignoring the data entirely—your model would be correct 990 times out of 1,000. Your model would have 99% Accuracy. On paper, this looks spectacular. In practice, the model is useless. It failed to catch a single instance of fraud. This highlights the limitation of Accuracy: in datasets with imbalanced classes (where one outcome is much rarer than the other), accuracy hides the model's failures. To evaluate a classification model effectively, we need to look under the hood. We need the Confusion Matrix. The Confusion Matrix The Confusion Matrix is not a complex mathematical formula; it is a simple 2x2 tally sheet. It breaks down your model’s predictions into four distinct categories based on two questions: 1. What did the model predict? 2. What actually happened? A 2x2 grid representing a Confusion Matrix. The columns are labeled 'Predicted: No' and 'Predicted: Yes'. The rows are labeled 'Actual: No' and 'Actual: Yes'. The four quadrants are labeled: Top-Left 'True Negative (TN)', Top-Right 'False Positive (FP)', Bottom-Left 'False Negative (FN)', and Bottom-Right 'True Positive (TP)'.
A 2x2 grid representing a Confusion Matrix. The columns are labeled 'Predicted: No' and 'Predicted: Yes'. The rows are labeled 'Actual: No' and 'Actual: Yes'. The four quadrants are labeled: Top-Left 'True Negative (TN)', Top-Right 'False Positive (FP)', Bottom-Left 'False Negative (FN)', and Bottom-Right 'True Positive (TP)'. Let’s break down these four quadrants using a Customer Churn context (predicting if a customer will cancel their subscription).
- True Positive (TP): The model predicted the customer would churn, and they did. (A "Hit"). 2. True Negative (TN): The model predicted the customer would stay, and they did. (A correct non-event). 3. False Positive (FP): The model predicted the customer would churn, but they stayed. This is often called a Type I Error or a "False Alarm." 4. False Negative (FN): The model predicted the customer would stay, but they churned. This is a Type II Error or a "Miss." In a business context, not all errors are created equal. A False Positive might mean you send a discount coupon to a happy customer (a small cost). A False Negative might mean you lose a high-value client because you didn't know they were unhappy (a high cost). Precision and Recall Once we have the counts from the Confusion Matrix, we can calculate two specific metrics that tell us much more than simple accuracy: Precision and Recall. Precision: The "Boy Who Cried Wolf" Metric Precision answers the question: Of all the times the model predicted 'Yes', how often was it right? If your model predicts that 100 customers will churn, but only 20 actually do, your model has low precision. You are "crying wolf" too often. Low precision in a spam filter means legitimate emails are getting thrown into the junk folder (False Positives). $$ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $$ Recall: The "Fishing Net" Metric Recall (also known as Sensitivity) answers the question: Of all the actual 'Yes' cases in the data, how many did the model manage to find? If 100 customers actually churned, and your model successfully identified 90 of them, you have high recall. You cast a wide net. High recall is critical in medical diagnostics; if a patient has a disease, we cannot afford to miss it (False Negatives). $$ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $$ A diagram illustrating the difference between Precision and Recall. On the left, a 'Precision' focus shows a small circle selecting only a few high-confidence red dots (positives) among blue dots (negatives), minimizing false positives. On the right, a 'Recall' focus shows a large circle capturing all red dots but accidentally including several blue dots (false positives).
A diagram illustrating the difference between Precision and Recall. On the left, a 'Precision' focus shows a small circle selecting only a few high-confidence red dots (positives) among blue dots (negatives), minimizing false positives. On the right, a 'Recall' focus shows a large circle capturing all red dots but accidentally including several blue dots (false positives). The Tug-of-War There is almost always a trade-off. If you tune your model to catch every fraudster (High Recall), you will inevitably flag some innocent customers (Lower Precision). If you tune your model to never falsely accuse an innocent customer (High Precision), you will inevitably miss some sophisticated fraudsters (Lower Recall). As a Data Scientist, your job is to ask the business stakeholder: Which error is more expensive? Is it worse to miss a sale (Low Recall)? Or is it worse to annoy a customer with irrelevant ads (Low Precision)? Implementing in Python Let's see how Scikit-Learn handles these metrics. We will assume we have already trained a Logistic Regression model named log_reg and have split our data into X_test and y_test.
from sklearn.metrics import confusion_matrix, classification_report
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# 1. Generate Predictions
# The model predicts 0 (No Churn) or 1 (Churn)
y_pred = log_reg.predict(X_test)
# 2. Create the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
# 3. Visualize the Matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.title('Confusion Matrix for Customer Churn')
plt.show()The code above generates a heatmap allowing you to visually inspect the True Positives versus the errors. However, calculating the math manually is tedious. Scikit-Learn provides a summary tool called the Classification Report.
# 4. Generate a full performance report
print(classification_report(y_test, y_pred))
Output Example:precision recall f1-score support
0 0.85 0.90 0.87 150
1 0.75 0.60 0.67 50
accuracy 0.82 200
macro avg 0.80 0.75 0.77 200
weighted avg 0.82 0.82 0.82 200Interpreting the Report: Class 1 (Churn): Precision (0.75): When the model predicts a customer will churn, it is correct 75% of the time. Recall (0.60): The model only caught 60% of the customers who actually churned. It missed 40% of them. F1-Score: This is the "Harmonic Mean" of Precision and Recall. It provides a single score that balances both metrics. If you need a balance between precision and recall, the F1-score is your go-to metric. Summary In this section, we moved beyond the "Accuracy Trap." You learned that in real-world business problems—especially those involving rare events like fraud or churn—accuracy is often misleading. By using the Confusion Matrix, we can dissect exactly how our model is making mistakes. Use Precision when the cost of a False Positive is high (e.g., spam filters, stock market buy signals). Use Recall when the cost of a False Negative is high (e.g., disease screening, safety defects). Now that we can accurately evaluate our models, we are ready to explore how to improve them. In the next chapter, we will look at Ensemble Methods, where we combine multiple models (like Random Forests) to achieve performance that a single Decision Tree could never match. Case Study: Predicting Employee Attrition We have reached the synthesis of our classification journey. In the previous sections, we mastered the algorithms (Logistic Regression and Decision Trees) and learned the language of critique (Precision, Recall, and the Confusion Matrix). Now, we leave the classroom and enter the boardroom. We are going to apply these techniques to a domain that was historically dominated by "gut feeling" but is rapidly becoming one of the most data-intensive functions in business: Human Resources (HR) or "People Analytics." In this case study, we will simulate a real-world project. You have been tasked by the Chief Human Resources Officer (CHRO) to solve a critical problem: Employee Attrition. The Business Problem Hiring new employees is expensive. Research suggests that the cost of replacing an employee ranges from 50% to 200% of their annual salary. Beyond the financial cost, high turnover lowers morale and results in lost institutional knowledge. The CHRO poses a challenge: "We know people leave, but we don't know who, and we don't know why. Can you build a model to predict which employees are at risk of leaving so we can intervene before they resign?" This is a classic Binary Classification problem. Input (X): Employee demographics, job role, satisfaction scores, overtime history, etc. Output (y): Attrition (Yes/No). Step 1: Data Preparation and Encoding For this case study, we will use a dataset commonly referenced in the industry (based on IBM HR Analytics data). It contains numerical data (like Age) and categorical data (like Department). As we discussed in the Feature Engineering chapter, machine learning models (mostly) speak math, not English. They cannot understand the string "Sales" or "Research." We must translate these categories into numbers. We will use a technique called One-Hot Encoding (or dummy variables). This process creates a new binary column for every unique category. A conceptual diagram of One-Hot Encoding. On the left, a single column named "Department" contains values "Sales", "R&D", and "HR". An arrow points to the right showing three new columns: "Department_Sales", "Department_RnD", and "Department_HR". The rows contain 1s and 0s indicating membership to those categories.
A conceptual diagram of One-Hot Encoding. On the left, a single column named "Department" contains values "Sales", "R&D", and "HR". An arrow points to the right showing three new columns: "Department_Sales", "Department_RnD", and "Department_HR". The rows contain 1s and 0s indicating membership to those categories. Let's look at the Python implementation using pandas:
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the dataset (hypothetical path)
df = pd.read_csv('hr_employee_attrition.csv')
# 1. Select relevant features
# We drop 'EmployeeCount' and 'StandardHours' as they are the same for everyone
features = ['Age', 'Department', 'DistanceFromHome', 'EnvironmentSatisfaction',
'OverTime', 'MonthlyIncome', 'JobRole']
target = 'Attrition' # Values are 'Yes' or 'No'
X = df[features]
y = df[target].apply(lambda x: 1 if x == 'Yes' else 0) # Convert Target to 1/0
# 2. One-Hot Encoding for categorical variables (Department, OverTime, JobRole)
# drop_first=True avoids multicollinearity (redundancy)X_encoded = pd.get_dummies(X, drop_first=True)
print("Original shape:", X.shape) print("Encoded shape:", X_encoded.shape) When you run this, you will notice the number of columns increases. The column OverTime (containing "Yes"/"No") becomes OverTime_Yes (containing 1/0). This prepares our data for the algorithm. Step 2: Training the Model For this problem, we will choose a Decision Tree Classifier. Why? Because in HR, explainability is paramount. If our model flags an employee as "High Risk," the HR manager will immediately ask, "Why?" A Decision Tree provides clear logic (e.g., "Because they work Overtime AND live far away") that a Neural Network or a complex ensemble might hide.
from sklearn.tree import DecisionTreeClassifier
# Split data: 80% for training, 20% for testingX_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
Initialize the Decision Tree
We limit depth to avoid overfitting (making the tree too complex)
clf = DecisionTreeClassifier(max_depth=4, random_state=42)
Train the model
clf.fit(X_train, y_train)
print("Model training complete.") Step 3: Evaluation and Business Logic Now comes the critical step: answering the stakeholder's question, "How good is the model?" In the previous section, we learned about the Confusion Matrix. Let's apply it here. In the context of Employee Attrition:
- True Positive (TP): We predicted they would leave, and they did. (Success: We might have saved them).
- True Negative (TN): We predicted they would stay, and they stayed. (Success).
- False Positive (FP): We predicted they would leave, but they stayed. (The "Crying Wolf" error).
- False Negative (FN): We predicted they would stay, but they left. (The "Missed Opportunity" error).
from sklearn.metrics import confusion_matrix, classification_report
# Make predictions on the unseen test set
y_pred = clf.predict(X_test)
# Generate the metrics
print(classification_report(y_test, y_pred))Interpreting the Results for Stakeholders: Let's assume your model produces the following Recall score for the "Leavers" class (Class 1): 0.45 (45%). If you present this simply as a number, you might fail to convey the value. You must translate this into business terms: "Current status: We currently react to resignations after they happen. > > New Model status: This model effectively identifies 45% of the employees who are about to resign before they turn in their letter. While it misses some (False Negatives), it gives HR a targeted list of at-risk employees to engage with, rather than guessing blindly." Step 4: Visualizing the Decision Logic The true power of the Decision Tree is visual. We can plot the tree to understand the root causes of attrition.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(20,10))
plot_tree(clf, feature_names=X_encoded.columns, class_names=['Stay', 'Leave'], filled=True, fontsize=10)
plt.show()
A visualization of a Decision Tree for HR data. The Root Node at the top shows "OverTime_Yes <= 0.5". The branch for "True" (No Overtime) goes left to a blue node indicating most people stay. The branch for "False" (Yes Overtime) goes right to a node asking "MonthlyIncome <= 3000". The visualization demonstrates how the model splits employees into risk pools based on specific criteria.A visualization of a Decision Tree for HR data. The Root Node at the top shows "OverTime_Yes <= 0.5". The branch for "True" (No Overtime) goes left to a blue node indicating most people stay. The branch for "False" (Yes Overtime) goes right to a node asking "MonthlyIncome <= 3000". The visualization demonstrates how the model splits employees into risk pools based on specific criteria. By analyzing this tree, you might discover insights to feed back to management: 1. The "Overtime" Split: The very first split often separates those who work overtime from those who don't. This suggests burnout is a primary driver. 2. The "Income" Split: Among those working overtime, low income creates a high-risk "leaf node." Step 5: Feature Importance Finally, we can extract the "Feature Importance" scores. This tells us which variables had the heaviest weight in the decision-making process.
import pandas as pd
# Create a dataframe of feature importance
importance = pd.DataFrame({'Feature': X_encoded.columns, 'Importance': clf.feature_importances_})
print(importance.sort_values(by='Importance', ascending=False).head(5))Sample Output: 1. OverTime_Yes: 0.28 2. MonthlyIncome: 0.21 3. Age: 0.15 4. DistanceFromHome: 0.10 5. JobRole_SalesRepresentative: 0.08 Conclusion: From Prediction to Policy This case study illustrates the transition from "Data Science" to "Business Intelligence." You started with raw data and ended with a strategic recommendation: To reduce turnover, the company should review its Overtime policies, specifically for lower-income Sales Representatives who live far from the office. We did not just predict who would leave; we used the transparency of the Decision Tree to understand what needs to change in the organization. In the next chapter, we will move away from Supervised Learning (where we have the answers) and explore Unsupervised Learning, where we ask the machine to discover hidden patterns in data without any guidance at all.
