# Overview

Cluster analysis is a powerful toolkit in the data science workbench. It is used to find groups of observations (clusters) that share similar characteristics. These similarities can inform all kinds of business decisions; for example, in marketing, it is used to identify distinct groups of customers for which advertisements can be tailored. We will explore two commonly used clustering methods - hierarchical clustering and k-means clustering. We'll build a strong intuition for how they work and how to interpret their results. We'll develop this intuition by exploring three different datasets: soccer player positions, wholesale customer spending data, and longitudinal occupational wage data.

# Libraries

library(dplyr)
library(ggplot2)
library(dummies)
library(dendextend)
library(purrr)
library(cluster)
library(tibble)
library(tidyr)

Warning message:
"package 'tibble' was built under R version 3.6.3"Warning message:
"package 'tidyr' was built under R version 3.6.3"

# Calculating distance between observations

Cluster analysis seeks to find groups of observations that are similar to one another, but the identified groups are different from each other. This similarity/difference is captured by the metric called distance. We will calculate the distance between observations for both continuous and categorical features. We will also develop an intuition for how the scales of features can affect distance.

## What is cluster analysis?

A form of exploratory data analysis (EDA) where observations are divided into meaningful groups that share common characteristics(features).

### When to cluster?

• Using consumer behavior data to identify distinct segments within a market.

## Distance between two observations

### Distance vs Similarity

$distance = 1 - similarity$

### dist() function

two_players = data.frame(X=c(0, 9), Y=c(0,12))
two_players %>%
dist(method="euclidean")

   1
2 15
three_players = data.frame(X=c(0, 9, -2), Y=c(0,12, 19))
three_players %>%
dist()

         1        2
2 15.00000
3 19.10497 13.03840
players <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/94af7037c5834527cc8799a9723ebf3b5af73015/lineup.rds")))

x y
-1 1
-2 -3
8 6
7 -8
-12 8
-15 0
# Plot the positions of the players
ggplot(two_players, aes(x = X, y = Y)) +
geom_point() +
# Assuming a 40x60 field
lims(x = c(-30,30), y = c(-20, 20))

# Split the players data frame into two observations
player1 <- two_players[1, ]
player2 <- two_players[2, ]

# Calculate and print their distance using the Euclidean Distance formula
player_distance <- sqrt( (player1$X - player2$X)^2 + (player1$Y - player2$Y)^2 )
player_distance

15

The dist() function makes life easier when working with many dimensions and observations.

dist(three_players)

         1        2
2 15.00000
3 19.10497 13.03840
three_players

X Y
0 0
9 12
-2 19

## The importance of scale

when a variable is on a larger scale than other variables in data it may disproportionately influence the resulting distance calculated between the observations.

scale() function by default centers & scales column features.

## Measuring distance for categorical data

### Dummication in R

dummy.data.frame()

job_survey = read.csv("datasets/job_survey.csv")
job_survey

job_satisfaction is_happy
Low No
Low No
Hi Yes
Low No
Mid No
# Dummify the Survey Data
dummy_survey <- dummy.data.frame(job_survey)

# Calculate the Distance
dist_survey <- dist(dummy_survey, method="binary")
# Print the Distance Matrix
dist_survey

Warning message in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
"non-list contrasts argument ignored"Warning message in model.matrix.default(~x - 1, model.frame(~x - 1), contrasts = FALSE):
"non-list contrasts argument ignored"
          1         2         3         4
2 0.0000000
3 1.0000000 1.0000000
4 0.0000000 0.0000000 1.0000000
5 0.6666667 0.6666667 1.0000000 0.6666667

this distance metric successfully captured that observations 1 and 2 are identical (distance of 0)

# Hierarchical clustering

How do you find groups of similar observations (clusters) in data using the calculated distances? We will explore the fundamental principles of hierarchical clustering - the linkage criteria and the dendrogram plot - and how both are used to build clusters. We will also explore data from a wholesale distributor in order to perform market segmentation of clients using their spending habits.

## Comparing more than two observations

### Hierarchical clustering

• Complete Linkage: maximum distance between two sets
• Single Linkage: minimum distance between two sets
• Average Linkage: average distance between two sets

## Capturing K clusters

### Hierarchical clustering in R

• hclust() function to calculate the iterative linkage steps
• cutree() function to extract the cluster assignments for the desired number (k) of clusters.

positions of 12 players at the start of a 6v6 soccer match.

head(players)

x y
-1 1
-2 -3
8 6
7 -8
-12 8
-15 0
dist_players = dist(players, method = "euclidean")
hc_players = hclust(dist_players, method = "complete")


### Extracting K clusters

(cluster_assignments <- cutree(hc_players, k=2))

<ol class=list-inline>
• 1
• 1
• 2
• 2
• 1
• 1
• 1
• 2
• 2
• 2
• 1
• 2
• </ol>
head(
players_clustered <- players %>%
mutate(cluster = cluster_assignments),
10)

x y cluster
-1 1 1
-2 -3 1
8 6 2
7 -8 2
-12 8 1
-15 0 1
-13 -10 1
15 16 2
21 2 2
12 -15 2

players_clustered data frame contains the x & y positions of 12 players at the start of a 6v6 soccer game to which we have added clustering assignments based on the following parameters:

• Distance: Euclidean
• Number of Clusters (k): 2

### Exploring the clusters

# Count the cluster assignments
count(players_clustered, cluster)

cluster n
1 6
2 6

### Visualizing K Clusters

Because clustering analysis is always in part qualitative, it is incredibly important to have the necessary tools to explore the results of the clustering.

players_clustered %>%
ggplot(aes(x=x, y=y, color=factor(cluster)))+
geom_point()


## Visualizing the dendrogram

### Plotting the dendrogram

plot(hc_players)

# Prepare the Distance Matrix
dist_players <- dist(players)

# Generate hclust for complete, single & average linkage methods
hc_complete <- hclust(dist_players, method="complete")
hc_single <- hclust(dist_players, method="single")
hc_average <- hclust(dist_players, method="average")

# Plot & Label the 3 Dendrograms Side-by-Side
# Hint: To see these Side-by-Side run the 4 lines together as one command
par(mfrow = c(1,3))


### Height of the tree

An advantage of working with a clustering method like hierarchical clustering is that you can describe the relationships between your observations based on both the distance metric and the linkage metric selected (the combination of which defines the height of the tree).

## Cutting the tree

### Coloring the dendrogram - height

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=15)
plot(dend_colored)

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, h=10)
plot(dend_colored)


### Coloring the dendrogram - K

dend_players = as.dendrogram(hc_players)
dend_colored = color_branches(dend_players, k=2)
plot(dend_colored)


### cutree() using height

cluster_assignments <- cutree(hc_players, h=15)
cluster_assignments

<ol class=list-inline>
• 1
• 1
• 2
• 3
• 4
• 4
• 5
• 2
• 6
• 3
• 4
• 6
• </ol>
head(
players_clustered <-
players %>%
mutate(cluster = cluster_assignments),
10)

x y cluster
-1 1 1
-2 -3 1
8 6 2
7 -8 3
-12 8 4
-15 0 4
-13 -10 5
15 16 2
21 2 6
12 -15 3
dist_players <- dist(players, method = 'euclidean')
hc_players <- hclust(dist_players, method = "complete")

# Create a dendrogram object from the hclust variable
dend_players <- as.dendrogram(hc_players)

# Plot the dendrogram
plot(dend_players)

# Color branches by cluster formed from the cut at a height of 20 & plot
dend_20 <- color_branches(dend_players, h = 20)

# Plot the dendrogram with clusters colored below height 20
plot(dend_20)

# Color branches by cluster formed from the cut at a height of 40 & plot
dend_40 <- color_branches(dend_players, h=40)

# Plot the dendrogram with clusters colored below height 40
plot(dend_40)


The height of any branch is determined by the linkage and distance decisions (in this case complete linkage and Euclidean distance). While the members of the clusters that form below a desired height have a maximum linkage+distance amongst themselves that is less than the desired height.

## Making sense of the clusters

### Wholesale dataset

• 45 observations
• 3 features:
• Milk Spending
• Grocery Spending
• Frozen Food Spending
head(
10
)

Milk Grocery Frozen
11103 12469 902
2013 6550 909
1897 5234 417
1304 3643 3045
3199 6986 1455
4560 9965 934
879 2060 264
6243 6360 824
13316 20399 1809
5302 9785 364

### Exploring more than 2 dimensions

• Plot 2 dimensions at a time
• Visualize using PCA
• Summary statistics by feature

### Segment wholesale customers

# Calculate Euclidean distance between customers
dist_customers <- dist(ws_customers, method="euclidean")

# Generate a complete linkage analysis
hc_customers <- hclust(dist_customers, method="complete")

# Plot the dendrogram
plot(hc_customers)

# Create a cluster assignment vector at h = 15000
clust_customers <- cutree(hc_customers, h=15000)

# Generate the segmented customers data frame
segment_customers <- mutate(ws_customers, cluster = clust_customers),
10
)

Milk Grocery Frozen cluster
11103 12469 902 1
2013 6550 909 2
1897 5234 417 2
1304 3643 3045 2
3199 6986 1455 2
4560 9965 934 2
879 2060 264 2
6243 6360 824 2
13316 20399 1809 3
5302 9785 364 2

### Explore wholesale customer clusters

Since we are working with more than 2 dimensions it would be challenging to visualize a scatter plot of the clusters, instead we will rely on summary statistics to explore these clusters. We will analyze the mean amount spent in each cluster for all three categories.

# Count the number of customers that fall into each cluster
count(segment_customers, cluster)

cluster n
1 5
2 29
3 5
4 6
# Color the dendrogram based on the height cutoff
dend_customers <- as.dendrogram(hc_customers)
dend_colored <- color_branches(dend_customers, h=15000)

# Plot the colored dendrogram
plot(dend_colored)

# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(list(mean))

cluster Milk Grocery Frozen
1 16950.000 12891.400 991.200
2 2512.828 5228.931 1795.517
3 10452.200 22550.600 1354.800
4 1249.500 3916.833 10888.667
• Customers in cluster 1 spent more money on Milk than any other cluster.
• Customers in cluster 3 spent more money on Grocery than any other cluster.
• Customers in cluster 4 spent more money on Frozen goods than any other cluster.
• The majority of customers fell into cluster 2 and did not show any excessive spending in any category.

whether they are meaningful depends heavily on the business context of the clustering.

# K-means clustering

Build an understanding of the principles behind the k-means algorithm, explore how to select the right k when it isn't previously known, and revisit the wholesale data from a different perspective.

### Introduction to K-means

head(
10
)

x y
-1 1
-2 -3
8 6
7 -8
-12 8
-15 0
-13 -10
15 16
21 2
12 -15

### kmeans()

model = kmeans(lineup, centers = 2)


### Assigning clusters

head(model)

$cluster <ol class=list-inline> • 2 • 2 • 1 • 1 • 2 • 2 • 2 • 1 • 1 • 1 • 2 • 1 • </ol>$centers
x y
14.83333 0.1666667
-11.33333 -0.5000000
$totss 3489.91666666667$withinss
<ol class=list-inline>
• 863.666666666667
• 570.833333333333
• </ol>
$tot.withinss 1434.5$betweenss
2055.41666666667
model$cluster  <ol class=list-inline> • 2 • 2 • 1 • 1 • 2 • 2 • 2 • 1 • 1 • 1 • 2 • 1 • </ol> head( lineup_clustered <- lineup %>% mutate(cluster=model$cluster),
10
)

x y cluster
-1 1 2
-2 -3 2
8 6 1
7 -8 1
-12 8 2
-15 0 2
-13 -10 2
15 16 1
21 2 1
12 -15 1

### K-means on a soccer field

# Build a kmeans model
model_km2 <- kmeans(lineup, centers = 2)

# Extract the cluster assignment vector from the kmeans model
clust_km2 <- model_km2$cluster # Create a new data frame appending the cluster assignment lineup_km2 <- mutate(lineup, cluster = clust_km2) # Plot the positions of the players and color them using their cluster ggplot(lineup_km2, aes(x = x, y = y, color = factor(cluster))) + geom_point()  ### K-means on a soccer field (part 2) # Build a kmeans model model_km3 <- kmeans(lineup, centers=3) # Extract the cluster assignment vector from the kmeans model clust_km3 <- model_km3$cluster

# Create a new data frame appending the cluster assignment
lineup_km3 <- mutate(lineup, cluster=clust_km3)

# Plot the positions of the players and color them using their cluster
ggplot(lineup_km3, aes(x = x, y = y, color = factor(cluster))) +
geom_point()


## Evaluating different values of K by eye

### Generating the elbow plot

tot_withinss <- map_dbl(1:10, function(k){
model_l <- kmeans(lineup, centers = k)
model_l$tot.withinss }) head( elbow_df_l <- data.frame( k=1:10, tot_withinss = tot_withinss ), 10)  k tot_withinss 1 3489.9167 2 1434.5000 3 881.2500 4 622.5000 5 481.6667 6 265.1667 7 350.5000 8 96.5000 9 82.0000 10 73.5000 ### Generating the elbow plot elbow_df_l %>% ggplot(aes(x=k, y=tot_withinss)) + geom_line() + scale_x_continuous(breaks = 1:10)  ## Silhouette analysis: observation level performance ### Silhouette analysis Silhouette analysis allows you to calculate how similar each observations is with the cluster it is assigned relative to other clusters. This metric (silhouette width) ranges from -1 to 1 for each observation in your data and can be interpreted as follows: • Values close to 1 suggest that the observation is well matched to the assigned cluster • Values close to 0 suggest that the observation is borderline matched between two clusters • Values close to -1 suggest that the observations may be assigned to the wrong cluster ### Calculating S(i) pam_k3 <- pam(lineup, k=3) head(pam_k3$silinfo$widths, 10)  cluster neighbor sil_width 4 1 2 0.465320054 2 1 3 0.321729341 10 1 2 0.311385893 1 1 3 0.271890169 9 2 1 0.443606497 8 2 1 0.398547473 12 2 1 0.393982685 3 2 1 -0.009151755 11 3 1 0.546797052 6 3 1 0.529967901 ### Silhouette plot sil_plot <- silhouette(pam_k3) plot(sil_plot)  # Generate a k-means model using the pam() function with a k = 2 pam_k2 <- pam(lineup, k = 2) # Plot the silhouette visual for the pam_k2 model plot(silhouette(pam_k2))  for k = 2, no observation has a silhouette width close to 0? What about the fact that for k = 3, observation 3 is close to 0 and is negative? This suggests that k = 3 is not the right number of clusters. ### Average silhouette width pam_k3$silinfo$avg.width  0.353414012920685 • 1: Well matched to each cluster • 0: Onborder between clusters • -1: Poorly matched to each cluster ### Highest average silhouette width sil_width <- map_dbl(2:10, function(k){ model_sil <- pam(lineup, k=k) model_sil$silinfo$avg.width }) head(sil_df <- data.frame( k = 2:10, sil_width = sil_width ), 10)  k sil_width 2 0.4164141 3 0.3534140 4 0.3535534 5 0.3724115 6 0.3436130 7 0.3236397 8 0.3275222 9 0.2547311 10 0.2099424 ### Choosing K using average silhouette width sil_df %>% ggplot(aes(x=k, y=sil_width)) + geom_line() + scale_x_continuous(breaks = 2:10)  ## Making sense of the K-means clusters ### Segmenting with K-means • Estimate the "best" k using average silhouette width • Run k-means with the suggested k • Characterize the spending habits of these clusters of customers ### Revisiting wholesale data: "Best" k head(ws_customers, 10)  Milk Grocery Frozen 11103 12469 902 2013 6550 909 1897 5234 417 1304 3643 3045 3199 6986 1455 4560 9965 934 879 2060 264 6243 6360 824 13316 20399 1809 5302 9785 364 # Use map_dbl to run many models with varying value of k sil_width <- map_dbl(2:10, function(k){ model <- pam(x = ws_customers, k = k) model$silinfo$avg.width }) # Generate a data frame containing both k and sil_width sil_df <- data.frame( k = 2:10, sil_width = sil_width ) # Plot the relationship between k and sil_width ggplot(sil_df, aes(x = k, y = sil_width)) + geom_line() + scale_x_continuous(breaks = 2:10)  k = 2 has the highest average sillhouette width and is the “best” value of k we will move forward with. ### Revisiting wholesale data: Exploration From the previous analysis we have found that k = 2 has the highest average silhouette width. We will continue to analyze the wholesale customer data by building and exploring a kmeans model with 2 clusters. set.seed(42) # Build a k-means model for the customers_spend with a k of 2 model_customers <- kmeans(ws_customers, centers=2) # Extract the vector of cluster assignments from the model clust_customers <- model_customers$cluster

# Build the segment_customers data frame
segment_customers <- mutate(ws_customers, cluster = clust_customers)

# Calculate the size of each cluster
count(segment_customers, cluster)

cluster n
1 35
2 10
# Calculate the mean for each category
segment_customers %>%
group_by(cluster) %>%
summarise_all(list(mean))

cluster Milk Grocery Frozen
1 2296.257 5004 3354.343
2 13701.100 17721 1173.000

It seems that in this case cluster 1 consists of individuals who proportionally spend more on Frozen food while cluster 2 customers spent more on Milk and Grocery. When we explored this data using hierarchical clustering, the method resulted in 4 clusters while using k-means got us 2. Both of these results are valid, but which one is appropriate for this would require more subject matter expertise. Generating clusters is a science, but interpreting them is an art.

# Case Study: National Occupational mean wage

Explore how the average salary amongst professions have changed over time.

## Occupational wage data

data from the Occupational Employment Statistics (OES) program which produces employment and wage estimates annually. This data contains the yearly average income from 2001 to 2016 for 22 occupation groups. We would like to use this data to identify clusters of occupations that maintained similar income trends.

oes <- readRDS(gzcon(url("https://assets.datacamp.com/production/repositories/1219/datasets/1e1ec9f146a25d7c71a6f6f0f46c3de7bcefd36c/oes.rds")))

2001 2002 2003 2004 2005 2006 2007 2008 2010 2011 2012 2013 2014 2015 2016
Management 70800 78870 83400 87090 88450 91930 96150 100310 105440 107410 108570 110550 112490 115020 118020
Business Operations 50580 53350 56000 57120 57930 60000 62410 64720 67690 68740 69550 71020 72410 73800 75070
Computer Science 60350 61630 64150 66370 67100 69240 72190 74500 77230 78730 80180 82010 83970 86170 87880
Architecture/Engineering 56330 58020 60390 63060 63910 66190 68880 71430 75550 77120 79000 80100 81520 82980 84300
Life/Physical/Social Sci. 49710 52380 54930 57550 58030 59660 62020 64280 66390 67470 68360 69400 70070 71220 72930
Community Services 34190 34630 35800 37050 37530 39000 40540 41790 43180 43830 44240 44710 45310 46160 47200
• 22 Occupation Observations
• 15 Measurements of Average Income from 2001-2016
glimpse(oes)

 num [1:22, 1:15] 70800 50580 60350 56330 49710 ...
- attr(*, "dimnames")=List of 2
..$: chr [1:22] "Management" "Business Operations" "Computer Science" "Architecture/Engineering" ... ..$ : chr [1:15] "2001" "2002" "2003" "2004" ...

summary(oes)

      2001            2002            2003            2004
Min.   :16720   Min.   :17180   Min.   :17400   Min.   :17620
1st Qu.:26728   1st Qu.:27393   1st Qu.:27858   1st Qu.:28535
Median :34575   Median :35205   Median :36180   Median :37335
Mean   :37850   Mean   :39701   Mean   :41018   Mean   :42275
3rd Qu.:49875   3rd Qu.:53108   3rd Qu.:55733   3rd Qu.:57443
Max.   :70800   Max.   :78870   Max.   :83400   Max.   :87090
2005            2006            2007            2008
Min.   :17840   Min.   :18430   Min.   :19440   Min.   : 20220
1st Qu.:29043   1st Qu.:29688   1st Qu.:30810   1st Qu.: 31643
Median :37790   Median :39030   Median :40235   Median : 41510
Mean   :42775   Mean   :44329   Mean   :46074   Mean   : 47763
3rd Qu.:58005   3rd Qu.:59915   3rd Qu.:62313   3rd Qu.: 64610
Max.   :88450   Max.   :91930   Max.   :96150   Max.   :100310
2010             2011             2012             2013
Min.   : 21240   Min.   : 21430   Min.   : 21380   Min.   : 21580
1st Qu.: 32863   1st Qu.: 33430   1st Qu.: 33795   1st Qu.: 34120
Median : 42995   Median : 43610   Median : 44055   Median : 44565
Mean   : 49758   Mean   : 50555   Mean   : 51077   Mean   : 51800
3rd Qu.: 67365   3rd Qu.: 68423   3rd Qu.: 69253   3rd Qu.: 70615
Max.   :105440   Max.   :107410   Max.   :108570   Max.   :110550
2014             2015             2016
Min.   : 21980   Min.   : 22850   Min.   : 23850
1st Qu.: 34718   1st Qu.: 35425   1st Qu.: 36350
Median : 45265   Median : 46075   Median : 46945
Mean   : 52643   Mean   : 53785   Mean   : 55117
3rd Qu.: 71825   3rd Qu.: 73155   3rd Qu.: 74535
Max.   :112490   Max.   :115020   Max.   :118020  
dim(oes)

<ol class=list-inline>
• 22
• 15
• </ol>

there are no missing values, no categorical and the features are on the same scale.

### Next steps: hierarchical clustering

• Evaluate whether pre-processing is necessary
• Create a distance matrix
• Build a dendrogram
• Extract clusters from dendrogram
• Explore resulting clusters

### Hierarchical clustering: Occupation trees

The oes data is ready for hierarchical clustering without any preprocessing steps necessary. We will take the necessary steps to build a dendrogram of occupations based on their yearly average salaries and propose clusters using a height of 100,000.

# Calculate Euclidean distance between the occupations
dist_oes <- dist(oes, method = "euclidean")

# Generate an average linkage analysis
hc_oes <- hclust(dist_oes, method = "average")

# Create a dendrogram object from the hclust variable
dend_oes <- as.dendrogram(hc_oes)

# Plot the dendrogram
plot(dend_oes)

# Color branches by cluster formed from the cut at a height of 100000
dend_colored <- color_branches(dend_oes, h = 100000)

# Plot the colored dendrogram
plot(dend_colored)


Based on the dendrogram it may be reasonable to start with the three clusters formed at a height of 100,000. The members of these clusters appear to be tightly grouped but different from one another. Let's continue this exploration.

### Hierarchical clustering: Preparing for exploration

We have now created a potential clustering for the oes data, before we can explore these clusters with ggplot2 we will need to process the oes data matrix into a tidy data frame with each occupation assigned its cluster.

# Use rownames_to_column to move the rownames into a column of the data frame
df_oes <- rownames_to_column(as.data.frame(oes), var = 'occupation')

# Create a cluster assignment vector at h = 100,000
cut_oes <- cutree(hc_oes, h = 100000)

# Generate the segmented the oes data frame
clust_oes <- mutate(df_oes, cluster = cut_oes)

# Create a tidy data frame by gathering the year and values into two columns
key = year,
value = mean_salary,
-occupation, -cluster), 10)

occupation cluster year mean_salary
Management 1 2001 70800
Computer Science 2 2001 60350
Architecture/Engineering 2 2001 56330
Life/Physical/Social Sci. 2 2001 49710
Community Services 3 2001 34190
Legal 1 2001 69030
Education/Training/Library 3 2001 39130
Arts/Design/Entertainment 3 2001 39770
Healthcare Practitioners 2 2001 49930

### Hierarchical clustering: Plotting occupational clusters

We have succesfully created all the parts necessary to explore the results of this hierarchical clustering work. We will leverage the named assignment vector cut_oes and the tidy data frame gathered_oes to analyze the resulting clusters.

# View the clustering assignments by sorting the cluster assignment vector
sort(cut_oes)

<dl class=dl-horizontal>
Management
1
Legal
1
2
Computer Science
2
Architecture/Engineering
2
Life/Physical/Social Sci.
2
Healthcare Practitioners
2
Community Services
3
Education/Training/Library
3
Arts/Design/Entertainment
3
Healthcare Support
3
Protective Service
3
Food Preparation
3
Grounds Cleaning & Maint.
3
Personal Care
3
Sales
3
3
Farming/Fishing/Forestry
3
Construction
3
Installation/Repair/Maint.
3
Production
3
Transportation/Moving
3
</dl>
# Plot the relationship between mean_salary and year and color the lines by the assigned cluster
ggplot(gathered_oes, aes(x = year, y = mean_salary, color = factor(cluster))) +
geom_line(aes(group = occupation))


From this work it looks like both Management & Legal professions (cluster 1) experienced the most rapid growth in these 15 years. Let's see what we can get by exploring this data using k-means.

## Reviewing the HC results

### Next steps: k-means clustering

• Evaluate whether pre-processing is necessary
• Estimate the "best" k using the elbow plot
• Estimate the "best" k using the maximum average silhouette width
• Explore resulting clusters

### K-means: Elbow analysis

leverage the k-means elbow plot to propose the "best" number of clusters.

# Use map_dbl to run many models with varying value of k (centers)
tot_withinss <- map_dbl(1:10,  function(k){
model <- kmeans(x = oes, centers = k)
model$tot.withinss }) # Generate a data frame containing both k and tot_withinss elbow_df <- data.frame( k = 1:10, tot_withinss = tot_withinss ) # Plot the elbow plot ggplot(elbow_df, aes(x = k, y = tot_withinss)) + geom_line() + scale_x_continuous(breaks = 1:10)  ### K-means: Average Silhouette Widths So hierarchical clustering resulting in 3 clusters and the elbow method suggests 2. We will use average silhouette widths to explore what the "best" value of k should be. # Use map_dbl to run many models with varying value of k sil_width <- map_dbl(2:10, function(k){ model <- pam(oes, k = k) model$silinfo\$avg.width
})

# Generate a data frame containing both k and sil_width
sil_df <- data.frame(
k = 2:10,
sil_width = sil_width
)

# Plot the relationship between k and sil_width
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:10)


It seems that this analysis results in another value of k, this time 7 is the top contender (although 2 comes very close).

### Comparing the two clustering methods

Hierarchical Clustering k-means
Distance Used: virtually any euclidean only
Results Stable: Yes No
Evaluating # of Clusters: dendrogram, silhouette,elbow silhouette,elbow
Computation Complexity: Relatively Higher RelativelyLower