Cluster Analysis in R
build a strong intuition for how they work and how to interpret hierarchical clustering and k-means clustering results
- Overview
- Libraries
- Calculating distance between observations
- Hierarchical clustering
- K-means clustering
- Case Study: National Occupational mean wage
Overview
Cluster analysis is a powerful toolkit in the data science workbench. It is used to find groups of observations (clusters) that share similar characteristics. These similarities can inform all kinds of business decisions; for example, in marketing, it is used to identify distinct groups of customers for which advertisements can be tailored. We will explore two commonly used clustering methods - hierarchical clustering and k-means clustering. We'll build a strong intuition for how they work and how to interpret their results. We'll develop this intuition by exploring three different datasets: soccer player positions, wholesale customer spending data, and longitudinal occupational wage data.
library(dplyr)
library(ggplot2)
library(dummies)
library(readr)
library(dendextend)
library(purrr)
library(cluster)
library(tibble)
library(tidyr)
Calculating distance between observations
Cluster analysis seeks to find groups of observations that are similar to one another, but the identified groups are different from each other. This similarity/difference is captured by the metric called distance. We will calculate the distance between observations for both continuous and categorical features. We will also develop an intuition for how the scales of features can affect distance.
two_players = data.frame(X=c(0, 9), Y=c(0,12))
two_players %>%
dist(method="euclidean")
three_players = data.frame(X=c(0, 9, -2), Y=c(0,12, 19))
three_players %>%
dist()
players <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/94af7037c5834527cc8799a9723ebf3b5af73015/lineup.rds")))
head(players)
# Plot the positions of the players
ggplot(two_players, aes(x = X, y = Y)) +
geom_point() +
# Assuming a 40x60 field
lims(x = c(-30,30), y = c(-20, 20))
# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]
# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$X - player2$X)^2 + (player1$Y - player2$Y)^2 )
player_distance
The dist() function makes life easier when working with many dimensions and observations.
dist(three_players)
three_players
job_survey = read.csv("datasets/job_survey.csv")
job_survey
# Dummify the Survey Data
dummy_survey <- dummy.data.frame(job_survey)
# Calculate the Distance
dist_survey <- dist(dummy_survey, method="binary")
# Print the Distance Matrix
dist_survey
this distance metric successfully captured that observations 1 and 2 are identical (distance of 0)
Hierarchical clustering
How do you find groups of similar observations (clusters) in data using the calculated distances? We will explore the fundamental principles of hierarchical clustering - the linkage criteria and the dendrogram plot - and how both are used to build clusters. We will also explore data from a wholesale distributor in order to perform market segmentation of clients using their spending habits.
head(players)
dist_players = dist(players, method = "euclidean")
hc_players = hclust(dist_players, method = "complete")
(cluster_assignments <- cutree(hc_players, k=2))
head(
players_clustered <- players %>%
mutate(cluster = cluster_assignments),
10)
players_clustered
data frame contains the x & y positions of 12 players at the start of a 6v6 soccer game to which we have added clustering assignments based on the following parameters:
- Distance: Euclidean
- Number of Clusters (k): 2
- Linkage Method: Complete
# Count the cluster assignments
count(players_clustered, cluster)
players_clustered %>%
ggplot(aes(x=x, y=y, color=factor(cluster)))+
geom_point()
plot(hc_players)
# Prepare the Distance Matrix
dist_players <- dist(players)
# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method="complete")
hc_single <- hclust(dist_players, method="single")
hc_average <- hclust(dist_players, method="average")
# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))
plot(hc_complete, main = 'Complete Linkage')
plot(hc_single, main = 'Single Linkage')
plot(hc_average, main = 'Average Linkage')
dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=15)
plot(dend_colored)
dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=10)
plot(dend_colored)
dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, k=2)
plot(dend_colored)
cluster_assignments <- cutree(hc_players, h=15)
cluster_assignments
head(
players_clustered <-
players %>%
mutate(cluster = cluster_assignments),
10)
dist_players <- dist(players, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")
# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)
# Plot the dendrogram
plot(dend_players)
# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)
# Plot the dendrogram with clusters colored below height 20
plot(dend_20)
# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h=40)
# Plot the dendrogram with clusters colored below height 40
plot(dend_40)
The height of any branch is determined by the linkage and distance decisions (in this case complete linkage and Euclidean distance). While the members of the clusters that form below a desired height have a maximum linkage+distance amongst themselves that is less than the desired height.
head(
ws_customers <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/3558d2b5564714d85120cb77a904a2859bb3d03e/ws_customers.rds"))) ,
10
)
# Calculate Euclidean distance between customers
dist_customers <- dist(ws_customers, method="euclidean")
# Generate a complete linkage analysis
hc_customers <- hclust(dist_customers, method="complete")
# Plot the dendrogram
plot(hc_customers)
# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h=15000)
# Generate the segmented customers data frame
head(
segment_customers <- mutate(ws_customers, cluster = clust_customers),
10
)
Explore wholesale customer clusters
Since we are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead we will rely on summary statistics to explore these clusters. We will analyze the mean amount spent in each cluster for all three categories.
# Count the number of customers that fall into each cluster
count(segment_customers, cluster)
# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h=15000)
# Plot the colored dendrogram
plot(dend_colored)
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(list(mean))
- Customers in cluster 1 spent more money on Milk than any other cluster.
- Customers in cluster 3 spent more money on Grocery than any other cluster.
- Customers in cluster 4 spent more money on Frozen goods than any other cluster.
- The majority of customers fell into cluster 2 and did not show any excessive spending in any category.
whether they are meaningful depends heavily on the business context of the clustering.
head(
lineup <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/94af7037c5834527cc8799a9723ebf3b5af73015/lineup.rds"))) ,
10
)
model = kmeans(lineup, centers = 2)
head(model)
model$cluster
head(
lineup_clustered <- lineup %>%
mutate(cluster=model$cluster),
10
)
# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)
# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster
# Create a new data frame appending the cluster assignment
lineup_km2 <- mutate(lineup, cluster = clust_km2)
# Plot the positions of the players and color them using their cluster
ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) +
geom_point()
# Build a kmeans model
model_km3 <- kmeans(lineup, centers=3)
# Extract the cluster assignment vector from the kmeans model
clust_km3 <- model_km3$cluster
# Create a new data frame appending the cluster assignment
lineup_km3 <- mutate(lineup, cluster=clust_km3)
# Plot the positions of the players and color them using their cluster
ggplot(lineup_km3, aes(x = x, y = y, color = factor(cluster))) +
geom_point()
tot_withinss <- map_dbl(1:10, function(k){
model_l <- kmeans(lineup, centers = k)
model_l$tot.withinss
})
head(
elbow_df_l <- data.frame(
k=1:10,
tot_withinss = tot_withinss
), 10)
elbow_df_l %>%
ggplot(aes(x=k, y=tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
Silhouette analysis: observation level performance
Silhouette analysis
Silhouette analysis allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters. This metric (silhouette width) ranges from -1 to 1 for each observation in your data and can be interpreted as follows:
- Values close to 1 suggest that the observation is well matched to the assigned cluster
- Values close to 0 suggest that the observation is borderline matched between two clusters
- Values close to -1 suggest that the observations may be assigned to the wrong cluster
Calculating S(i)
pam_k3 <- pam(lineup, k=3)
head(pam_k3$silinfo$widths, 10)
sil_plot <- silhouette(pam_k3)
plot(sil_plot)
# Generate a k-means model using the pam() function with a k = 2
pam_k2 <- pam(lineup, k = 2)
# Plot the silhouette visual for the pam_k2 model
plot(silhouette(pam_k2))
for k = 2, no observation has a silhouette width close to 0? What about the fact that for k = 3, observation 3 is close to 0 and is negative? This suggests that k = 3 is not the right number of clusters.
pam_k3$silinfo$avg.width
- 1: Well matched to each cluster
- 0: Onborder between clusters
- -1: Poorly matched to each cluster
sil_width <- map_dbl(2:10, function(k){
model_sil <- pam(lineup, k=k)
model_sil$silinfo$avg.width
})
head(sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
), 10)
sil_df %>%
ggplot(aes(x=k, y=sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
head(ws_customers, 10)
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10, function(k){
model <- pam(x = ws_customers, k = k)
model$silinfo$avg.width
})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
k = 2 has the highest average sillhouette width and is the “best” value of k we will move forward with.
set.seed(42)
# Build a k-means model for the customers_spend with a k of 2
model_customers <- kmeans(ws_customers, centers=2)
# Extract the vector of cluster assignments from the model
clust_customers <- model_customers$cluster
# Build the segment_customers data frame
segment_customers <- mutate(ws_customers, cluster = clust_customers)
# Calculate the size of each cluster
count(segment_customers, cluster)
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(list(mean))
It seems that in this case cluster 1 consists of individuals who proportionally spend more on Frozen food while cluster 2 customers spent more on Milk and Grocery. When we explored this data using hierarchical clustering, the method resulted in 4 clusters while using k-means got us 2. Both of these results are valid, but which one is appropriate for this would require more subject matter expertise. Generating clusters is a science, but interpreting them is an art.
Occupational wage data
data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. We would like to use this data to identify clusters of occupations that maintained similar income trends.
oes <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/1e1ec9f146a25d7c71a6f6f0f46c3de7bcefd36c/oes.rds")))
head(oes)
- 22 Occupation Observations
- 15 Measurements of Average Income from 2001-2016
glimpse(oes)
summary(oes)
dim(oes)
there are no missing values, no categorical and the features are on the same scale.
# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = "euclidean")
# Generate an average linkage analysis
hc_oes <- hclust(dist_oes, method = "average")
# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)
# Plot the dendrogram
plot(dend_oes)
# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)
# Plot the colored dendrogram
plot(dend_colored)
Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another. Let's continue this exploration.
# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')
# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)
# Generate the segmented the oes data frame
clust_oes <- mutate(df_oes, cluster = cut_oes)
# Create a tidy data frame by gathering the year and values into two columns
head(gathered_oes <- gather(data = clust_oes,
key = year,
value = mean_salary,
-occupation, -cluster), 10)
# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)
# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
geom_line(aes(group = occupation))
From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years. Let's see what we can get by exploring this data using k-means.
# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10, function(k){
model <- kmeans(x = oes, centers = k)
model$tot.withinss
})
# Generate a data frame containing both k and tot_withinss
elbow_df <- data.frame(
k = 1:10,
tot_withinss = tot_withinss
)
# Plot the elbow plot
ggplot(elbow_df, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:10)
# Use map_dbl to run many models with varying value of k
sil_width <- map_dbl(2:10, function(k){
model <- pam(oes, k = k)
model$silinfo$avg.width
})
# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)
# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)
It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).