Cluster Analysis in Python
exploring unsupervised learning through clustering using the SciPy library in Python
 Overview
 Libraries
 Introduction to Clustering
 Hierarchical Clustering
 KMeans Clustering
 Clustering in Real World
 Dominant colors in images
 Tools to find dominant colors
 Data frame with RGB values
 Create an elbow plot
 Extract RGB values from image
 Display dominant colors
 Document clustering
 concepts
 Clean and tokenize data
 TFIDF (Term Frequency  Inverse DocumentFrequency)
 Top terms per cluster
 More considerations
 TFIDF of movie plots
 Top terms in movie clusters
 Clustering with multiple features
 Dominant colors in images
Overview
You have probably come across Google News, which automatically groups similar news articles under a topic. Have you ever wondered what process runs in the background to arrive at these groups? We will be exploring unsupervised learning through clustering using the SciPy library in Python. We will cover preprocessing of data and application of hierarchical and kmeans clustering. We will explore player statistics from a popular football video game, FIFA 18. We will be able to quickly apply various clustering algorithms on data, visualize the clusters formed and analyze results.
import re
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import (linkage,
fcluster,
dendrogram)
from scipy.cluster.vq import (kmeans,
vq,
whiten)
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
import matplotlib.image as img
import seaborn as sns
%matplotlib inline
plt.style.use("ggplot")
Introduction to Clustering
Before we are ready to classify news articles, we need to be introduced to the basics of clustering. We will familiarize ourselves with a class of machine learning algorithms called unsupervised learning and clustering, one of the popular unsupervised learning algorithms. We will explore two popular clustering techniques  hierarchical clustering and kmeans clustering. We will conclude with basic preprocessing steps before we start clustering data.
Unsupervised learning: basics
Everyday example:Google news How does Google News classify articles?
 Unsupervised Learning Algorithm: Clustering
 Match frequent terms in articles to find similarity ### What is unsupervised learning?
 A group of machine learning algorithms that find patterns in data
 Data for algorithms has not been labeled, classified or characterized
 The objective of the algorithm is to interpret any structure in the data
 Common unsupervised learning algorithms:clustering, neural networks, anomaly detection> ### What is clustering?
 The process of grouping items with similar characteristics
 Items in groups similar to each other than in other groups
 Example:distance between points on a 2D plane ### Plotting data for clustering  Pokemon sightings
x_coordinates = [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 56, 49, 62, 44]
y_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25, 2, 10, 24, 10]
_ = sns.scatterplot(x_coordinates, y_coordinates)
plt.show()
Visualizing helps in determining how many clusters are in the data.
x_p = [9, 6, 2, 3, 1, 7, 1, 6, 1, 7, 23, 26, 25, 23, 21, 23, 23, 20, 30, 23]
y_p = [8, 4, 10, 6, 0, 4, 10, 10, 6, 1, 29, 25, 30, 29, 29, 30, 25, 27, 26, 30]
_ = sns.scatterplot(x_p, y_p)
plt.show()
Notice the areas where the sightings are dense. This indicates that there is not one, but two legendary Pokémon out there!
Basics of cluster analysis
What is a cluster?
 A group of items with similar characteristics
 Google News:articles where similar words andword associations appear together Customer Segments ### Clustering algorithms
 Hierarchical clustering
 K means clustering
 Other clustering algorithms:DBSCAN, Gaussian Methods ### Hierarchical clustering in SciPy
x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4, 10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4, 47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]
df_c = pd.DataFrame({'x_cood':x_coordinates, 'y_cood':y_coordinates})
df_c.head()
Z_c = linkage(df_c, method="ward")
df_c['cluster_labels'] = fcluster(Z_c, 3, criterion="maxclust")
_ = sns.scatterplot(data=df_c, x="x_cood", y="y_cood", hue="cluster_labels", palette="RdGy")
plt.show()
df_c = pd.DataFrame({'x_cood':x_coordinates, 'y_cood':y_coordinates})
centroids_c, _ = kmeans(df_c, 3)
df_c["cluster_labels"], _ = vq(df_c, centroids_c)
_ = sns.scatterplot(data=df_c, x="x_cood", y="y_cood", hue="cluster_labels", palette="RdGy")
plt.show()
Pokémon sightings: hierarchical clustering
We are going to continue the investigation into the sightings of legendary Pokémon. In the scatter plot we identified two areas where Pokémon sightings were dense. This means that the points seem to separate into two clusters. We will form two clusters of the sightings using hierarchical clustering.
df_p = pd.DataFrame({'x':x_p, 'y':y_p})
df_p.head()
'x' and 'y' are columns of X and Y coordinates of the locations of sightings, stored in a Pandas data frame,
# Use the linkage() function to compute distance
Z_p = linkage(df_p, 'ward')
# Generate cluster labels for each data point with two clusters
df_p['cluster_labels'] = fcluster(Z_p, 2, criterion='maxclust')
# Plot the points with seaborn
sns.scatterplot(x="x", y="y", hue="cluster_labels", data=df_p)
plt.show()
the resulting plot has an extra cluster labelled 0 in the legend.
Pokémon sightings: kmeans clustering
We are going to continue the investigation into the sightings of legendary Pokémon. We will use the same example of Pokémon sightings. We will form clusters of the sightings using kmeans clustering.
x and y are columns of X and Y coordinates of the locations of sightings, stored in a Pandas data frame
df_p.dtypes
df_p = df_p.apply(lambda x: x.astype("float"))
# Compute cluster centers
centroids_p, _ = kmeans(df_p, 2)
# Assign cluster labels to each data point
df_p['cluster_labels'], _ = vq(df_p, centroids_p)
# Plot the points with seaborn
sns.scatterplot(x="x", y="y", hue="cluster_labels", data=df_p)
plt.show()
Data preparation for cluster analysis
Why do we need to prepare data for clustering?
 Variables have incomparable units (product dimensions in cm, price in \$)
 Variables with same units have vastly different scales and variances (expenditures on cereals, travel)
 Data in raw form may lead to bias in clustering
 Clusters may be heavily dependent on one variable
 Solution:normalization of individual variables> ### Normalization of data
 Normalization:process of rescaling data to a standard deviation of 1
python x_new = x / std_dev(x)
data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]
scaled_data = whiten(data)
scaled_data
_ = sns.lineplot(x=range(len(data)), y=data, label="original")
_ = sns.lineplot(x=range(len(data)), y=scaled_data, label='scaled')
plt.show()
goals_for = [4,3,2,3,1,1,2,0,1,4]
# Use the whiten() function to standardize the data
scaled_goals_for = whiten(goals_for)
scaled_goals_for
the scaled values have less variations in them.
_ = sns.lineplot(x=range(len(goals_for)), y=goals_for, label="original")
_ = sns.lineplot(x=range(len(goals_for)), y=scaled_goals_for, label="scaled")
plt.show()
scaled values have lower variations in them.
# Prepare data
rate_cuts = [0.0025, 0.001, 0.0005, 0.001, 0.0005, 0.0025, 0.001, 0.0015, 0.001, 0.0005]
# Use the whiten() function to standardize the data
scaled_rate_cuts = whiten(rate_cuts)
# Plot original data
plt.plot(rate_cuts, label='original')
# Plot scaled data
plt.plot(scaled_rate_cuts, label='scaled')
plt.legend()
plt.show()
the original data are negligible as compared to the scaled data
fifa = pd.read_csv("datasets/fifa.csv")
fifa.head()
We will work with two columns, eur_wage
, the wage of a player in Euros and eur_value
, their current transfer market value.
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])
# Plot the two columns in a scatter plot
fifa.plot(x="scaled_wage", y="scaled_value", kind='scatter')
plt.show()
# Check mean and standard deviation of scaled values
fifa[['scaled_wage', 'scaled_value']].describe()
the scaled values have a standard deviation of 1.
Hierarchical Clustering
We willl focus on a popular clustering algorithm  hierarchical clustering  and its implementation in SciPy. In addition to the procedure to perform hierarchical clustering, it attempts to help you answer an important question  how many clusters are present in your data? We will conclude with a discussion on the limitations of hierarchical clustering and discuss considerations while using hierarchical clustering.
Basics of hierarchical clustering
Creating a distance matrix using linkage
scipy.cluster.hierarchy.linkage(observations, method='single', metric='euclidean', optimal_ordering=False )
method
:how to calculate the proximity of clustersmetric
: distance metricoptimal_ordering
: order data points ### Which method should use? single:based on two closest objects complete: based on two farthest objects
 average: based on the arithmetic mean of all objects
 centroid: based on the geometric mean of all objects
 median: based on the median of all objects
 ward: based on the sum of squares ### Create cluster labels with fcluster
scipy.cluster.hierarchy.fcluster(distance_matrix, num_clusters, criterion )distance_matrix
:output oflinkage()
methodnum_clusters
: number of clusterscriterion
: how to decide thresholds to form clustersFinal thoughts on selecting a method
 No one right method for all
 Need to carefully understand the distribution of data
Hierarchical clustering: ward method
It is time for ComicCon! ComicCon is an annual comicbased convention held in major cities in the world. We have the data of last year's footfall, the number of people at the convention ground at a given time. We would like to decide the location of the stall to maximize sales. Using the ward method, we'll apply hierarchical clustering to find the two points of attraction in the area.
comic_con = pd.read_csv("datasets/comic_con.csv")
comic_con.head()
# Use the linkage() function
distance_matrix_cc = linkage(comic_con[['x_scaled', 'y_scaled']], method = "ward", metric = 'euclidean')
# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix_cc, 2, criterion='maxclust')
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = comic_con)
plt.show()
the two clusters correspond to the points of attractions in the figure towards the bottom (a stage) and the top right (an interesting stall).
# Use the linkage() function
distance_matrix_cc = linkage(comic_con[['x_scaled', 'y_scaled']], method = "single", metric = "euclidean")
# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix_cc, 2, criterion="maxclust")
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = comic_con)
plt.show()
the clusters formed are not different from the ones created using the ward method.
# Use the linkage() function
distance_matrix_cc = linkage(comic_con[['x_scaled', 'y_scaled']], method="complete")
# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix_cc, 2, criterion="maxclust")
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = comic_con)
plt.show()
Coincidentally, the clusters formed are not different from the ward or single methods.
Visualize clusters
Why visualize clusters?
 Try to make sense of the clusters formed
 An additional step in validation of clusters
 Spot trends in data
An introduction to seaborn
seaborn
:a Python data visualization library based on matplotlib Has better, easily modiable aesthetics than matplotlib! Contains functions that make data visualization tasks easy in the context of data analytics
 Use case for clustering:
hue
parameter for plots
# Plot a scatter plot
comic_con.plot.scatter(x='x_scaled',
y='y_scaled',
c='cluster_labels')
plt.show()
# Plot a scatter plot using seaborn
sns.scatterplot(x='x_scaled',
y='y_scaled',
hue='cluster_labels',
data = comic_con)
plt.show()
# Create a dendrogram
dn_cc = dendrogram(distance_matrix_cc)
# Display the dendogram
plt.show()
the top two clusters are farthest away from each other.
points_s = 100
df_s = pd.DataFrame({'x':np.random.sample(points_s),
'y':np.random.sample(points_s)})
%timeit linkage(df_s[['x', 'y']], method='ward', metric='euclidean')
%timeit linkage(comic_con[['x_scaled', 'y_scaled']], method="complete", metric='euclidean')
FIFA 18: exploring defenders
In the FIFA 18 dataset, various attributes of players are present. Two such attributes are:
 sliding tackle: a number between 099 which signifies how accurate a player is able to perform sliding tackles
 aggression: a number between 099 which signifies the commitment and will of a player
These are typically high in defenseminded players. We will perform clustering based on these attributes in the data.
fifa[['sliding_tackle', 'aggression']].head()
fifa['scaled_sliding_tackle'] = whiten(fifa.sliding_tackle)
fifa['scaled_aggression'] = whiten(fifa.aggression)
# Fit the data into a hierarchical clustering algorithm
distance_matrix_f = linkage(fifa[['scaled_sliding_tackle', 'scaled_aggression']], 'ward')
# Assign cluster labels to each row of data
fifa['cluster_labels'] = fcluster(distance_matrix_f, 3, criterion='maxclust')
# Display cluster centers of each cluster
fifa[['scaled_sliding_tackle', 'scaled_aggression', 'cluster_labels']].groupby('cluster_labels').mean()
# Create a scatter plot through seaborn
sns.scatterplot(x="scaled_sliding_tackle", y="scaled_aggression", hue="cluster_labels", data=fifa)
plt.show()
KMeans Clustering
Exploring a different clustering algorithm  kmeans clustering  and its implementation in SciPy. Kmeans clustering overcomes the biggest drawback of hierarchical clustering. As dendrograms are specific to hierarchical clustering, we will discuss one method to find the number of clusters before running kmeans clustering. We will conclude with a discussion on the limitations of kmeans clustering and discuss considerations while using this algorithm.
Basics of kmeans clustering
Why kmeans clustering?
 A critical drawback of hierarchical clustering:runtime K means runs signicantly faster on large datasets ### Step 1:Generate cluster centers
python kmeans(obs, k_or_guess, iter, thresh, check_finite)
obs
: standardized observationsk_or_guess
: number of clustersiter
: number of iterations (default: 20)thres
: threshold (default: 1e05)check_finite
: whether to check if observations contain only finite numbers (default: True) Returns two objects:
 cluster centers, distortion ### Step 2:Generate cluster labels
python vq(obs, code_book, check_finite=True)
obs
: standardized observationscode_book
: cluster centerscheck_finite
: whether to check if observations contain only finite numbers (default: True) Returns two objects:
 a list of cluster labels,
 a list of distortions
A note on distortions
kmeans
returns a single value of distortionsvq
returns a list of distortions.Running kmeans
Kmeans clustering
Let us use the Comic Con dataset and check how kmeans clustering works on it.
the two steps of kmeans clustering:
 Define cluster centers through
kmeans()
function. It has two required arguments: observations and number of clusters.  Assign cluster labels through the
vq()
function. It has two required arguments: observations and cluster centers.
# Generate cluster centers
cluster_centers_cc, distortion_cc = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
# Assign cluster labels
comic_con['cluster_labels'], distortion_list_cc = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers_cc)
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = comic_con)
plt.show()
%timeit kmeans(fifa[['scaled_sliding_tackle', 'scaled_aggression']], 3)
It took only about 5 seconds to run hierarchical clustering on this data, but only 50 milliseconds to run kmeans clustering.
How many clusters?
How to find the right k?
 No absolute method to find right number of clusters (k) in kmeans clustering
 Elbow method
Distortions revisited
 Distortion:sum of squared distances of points from cluster centers Decreases with an increasing number ofclusters
 Becomes zero when the number of clusters equals the number of points
 Elbow plot: line plot between cluster centers and distortion ### Elbow method
 Elbow plot:plot of the number of clusters and distortion Elbow plot helps indicate number of clusters present in data ### Final thoughts on using the elbow method
 Only gives an indication of optimal k (numbers of clusters)
 Does not always pinpoint how many k (numbers of clusters)
 Other methods:average silhouette and gap statistic
distortions_cc = []
num_clusters_cc = range(1, 7)
# Create a list of distortions from the kmeans function
for i in num_clusters_cc:
cluster_centers_cc, distortion_cc = kmeans(comic_con[['x_scaled', 'y_scaled']], i)
distortions_cc.append(distortion_cc)
# Create a data frame with two lists  num_clusters, distortions
elbow_plot_cc = pd.DataFrame({'num_clusters': num_clusters_cc, 'distortions': distortions_cc})
# Creat a line plot of num_clusters and distortions
sns.lineplot(x="num_clusters", y="distortions", data = elbow_plot_cc)
plt.xticks(num_clusters_cc)
plt.show()
uniform_data = pd.read_csv("datasets/uniform_data.csv")
distortions_u = []
num_clusters_u = range(2, 7)
# Create a list of distortions from the kmeans function
for i in num_clusters_u:
cluster_centers_u, distortion_u = kmeans(uniform_data[['x_scaled', 'y_scaled']], i)
distortions_u.append(distortion_u)
# Create a data frame with two lists  number of clusters and distortions
elbow_plot_u = pd.DataFrame({'num_clusters': num_clusters_u, 'distortions': distortions_u})
# Creat a line plot of num_clusters and distortions
sns.lineplot(x="num_clusters", y="distortions", data=elbow_plot_u)
plt.xticks(num_clusters_u)
plt.show()
There is no well defined elbow in this plot!
# Initialize seed
np.random.seed(0)
# Run kmeans clustering
cluster_centers_cc, distortion_cc = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
comic_con['cluster_labels'], distortion_list_cc = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers_cc)
# Plot the scatterplot
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = comic_con)
plt.show()
# Initialize seed
np.random.seed([1,2,1000])
# Run kmeans clustering
cluster_centers_cc, distortion_cc = kmeans(comic_con[['x_scaled', 'y_scaled']], 2)
comic_con['cluster_labels'], distortion_list_cc = vq(comic_con[['x_scaled', 'y_scaled']], cluster_centers_cc)
# Plot the scatterplot
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = comic_con)
plt.show()
Uniform clustering patterns
let us look at the bias in kmeans clustering towards the formation of uniform clusters. Let us use a mouselike dataset for our next exercise. A mouselike dataset is a group of points that resemble the head of a mouse: it has three clusters of points arranged in circles, one each for the face and two ears of a mouse.
mouse = pd.read_csv("datasets/mouse.csv")
# Generate cluster centers
cluster_centers_m, distortion_m = kmeans(mouse[['x_scaled', 'y_scaled']], 3)
# Assign cluster labels
mouse['cluster_labels'], distortion_list_m = vq(mouse[['x_scaled', 'y_scaled']], cluster_centers_m)
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
hue='cluster_labels', data = mouse)
plt.show()
kmeans is unable to capture the three visible clusters clearly, and the two clusters towards the top have taken in some points along the boundary. This happens due to the underlying assumption in kmeans algorithm to minimize distortions which leads to clusters that are similar in terms of area.
FIFA 18: defenders revisited
In the FIFA 18 dataset, various attributes of players are present. Two such attributes are:
 defending: a number which signifies the defending attributes of a player
 physical: a number which signifies the physical attributes of a player
These are typically defenseminded players. We will perform clustering based on these attributes in the data.
fifa = pd.read_csv("datasets/fifa2.csv")
# Set up a random seed in numpy
np.random.seed([1000, 2000])
# Fit the data into a kmeans algorithm
cluster_centers,_ = kmeans(fifa[['scaled_def', 'scaled_phy']], 3)
# Assign cluster labels
fifa['cluster_labels'],_ = vq(fifa[['scaled_def', 'scaled_phy']], cluster_centers)
# Display cluster centers
fifa[['scaled_def', 'scaled_phy', 'cluster_labels']].groupby('cluster_labels').mean()
# Create a scatter plot through seaborn
sns.scatterplot(x="scaled_def", y="scaled_phy", hue="cluster_labels", data=fifa)
plt.show()
the seed has an impact on clustering as the data is uniformly distributed.
Clustering in Real World
Applying clustering knowledge to realworld problems. We will explore the process of finding dominant colors in an image, before moving on to the problem  clustering of news articles. We will conclude with a discussion on clustering with multiple variables, which makes it difficult to visualize all the data.
Dominant colors in images
 All images consist of pixelsEach pixel has three values: Red, Green and Blue
 Pixel color: combination of these RGB values
 Perform kmeans on standardized RGB values to find cluster centers

Uses: Identifying features in satellite images
 Convert image to pixels:
matplotlib.image.imread
 Display colors of cluster centers:matplotlib.pyplot.imshow
image = img.imread("datasets/sea.jpg")
image.shape
image[0][:1]
plt.imshow(image)
r = []
g = []
b = []
for row in image:
for pixel in row:
temp_r, temp_g, temp_b = pixel
r.append(temp_r)
g.append(temp_g)
b.append(temp_b)
pixels = pd.DataFrame({'red':r, 'green':g, 'blue':b})
pixels.head()
pixels[['scaled_red', 'scaled_blue', 'scaled_green']] = pixels[['red', 'blue', 'green']].apply(lambda x: x/np.std(x)*255)
distortions_i = []
num_clusters_i = []
for i in num_clusters_i:
cluster_centers_i, distortion_i = kmeans(pixels[['scaled_red', 'scaled_blue', 'scaled_green']], i)
distortions_i.append(distortion_i)
elbow_plot_i = pd.DataFrame({'num_clusters':num_clusters_i, 'distortions':distortions_i})
_ = sns.lineplot(data=elbow_plot_i, x="num_clusters", y='distortions')
plt.xticks(num_clusters_i)
plt.show()
batman_df = pd.read_csv("datasets/batman.csv")
batman_df.head()
distortions = []
num_clusters = range(1, 7)
# Create a list of distortions from the kmeans function
for i in num_clusters:
cluster_centers, distortion = kmeans(batman_df[['scaled_red', 'scaled_blue', 'scaled_green']], i)
distortions.append(distortion)
# Create a data frame with two lists, num_clusters and distortions
elbow_plot = pd.DataFrame({'num_clusters':num_clusters, 'distortions':distortions})
# Create a line plot of num_clusters and distortions
sns.lineplot(x="num_clusters", y="distortions", data = elbow_plot)
plt.xticks(num_clusters)
plt.show()
there are three distinct colors present in the image, which is supported by the elbow plot.
# Get standard deviations of each color
r_std, g_std, b_std = batman_df[['red', 'green', 'blue']].std()
colors = []
for cluster_center in cluster_centers:
scaled_r, scaled_g, scaled_b = cluster_center
# Convert each standardized value to scaled value
colors.append((
scaled_r * r_std / 255,
scaled_g * g_std / 255,
scaled_b * b_std / 255
))
# Display colors of cluster centers
plt.imshow([colors])
plt.show()
Document clustering
concepts
 Clean data before processing
 Determine the importance of the terms in a document (in TFIDF matrix)
 Cluster the TFIDF matrix4. Find top terms, documents in each cluste
Clean and tokenize data
 Convert text into smaller parts called tokens, clean data for processing
TFIDF (Term Frequency  Inverse DocumentFrequency)
 A weighted measure:evaluate how important a word is to a document in a collection> ### Clustering with sparse matrix
 kmeans() in SciPy does not support sparse matrices
 Use
.todense()
to convert to a matrixcluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)Top terms per cluster
 Cluster centers:lists with a size equal to the number of terms Each value in the cluster center is its importance
 Create a dictionary and print top terms
More considerations
 Work with hyperlinks, emoticons etc.
 Normalize words (run, ran, running > run)
.todense()
may not work with large datasets
def remove_noise(text, stop_words = []):
tokens = word_tokenize(text)
cleaned_tokens = []
for token in tokens:
token = re.sub('[^AZaz09]+', '', token)
if len(token) > 1 and token.lower() not in stop_words:
# Get lowercase
cleaned_tokens.append(token.lower())
return cleaned_tokens
remove_noise("It is lovely weather we are having. I hope the weather continues.")
TFIDF of movie plots
Let us use the plots of randomly selected movies to perform document clustering on. Before performing clustering on documents, they need to be cleaned of any unwanted noise (such as special characters and stop words) and converted into a sparse matrix through TFIDF of the documents.
stop_words_2 = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'youre', 'youve', 'youll', 'youd', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'shes', 'her', 'hers', 'herself', 'it', 'its', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'thatll', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'dont', 'should', 'shouldve', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'arent', 'couldn', 'couldnt', 'didn', 'didnt', 'doesn', 'doesnt', 'hadn', 'hadnt', 'hasn', 'hasnt', 'haven', 'havent', 'isn', 'isnt', 'ma', 'mightn', 'mightnt', 'mustn', 'mustnt', 'needn', 'neednt', 'shan', 'shant', 'shouldn', 'shouldnt', 'wasn', 'wasnt', 'weren', 'werent', 'won', 'wont', 'wouldn', 'wouldnt']
remove_noise("It is lovely weather we are having. I hope the weather continues.", stop_words=stop_words_2)
def remove_noise(text, stop_words = stop_words_2):
tokens = word_tokenize(text)
cleaned_tokens = []
for token in tokens:
token = re.sub('[^AZaz09]+', '', token)
if len(token) > 1 and token.lower() not in stop_words:
# Get lowercase
cleaned_tokens.append(token.lower())
return cleaned_tokens
plots = pd.read_csv("datasets/plots.csv")['0'].to_list()
plots[0][:10]
# Initialize TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=.1, max_df=.75, max_features=50, tokenizer=remove_noise)
# Use the .fit_transform() method on the list plots
tfidf_matrix = tfidf_vectorizer.fit_transform(plots)
num_clusters = 2
# Generate cluster centers through the kmeans function
cluster_centers, distortion = kmeans(tfidf_matrix.todense(), num_clusters)
# Generate terms from the tfidf_vectorizer object
terms = tfidf_vectorizer.get_feature_names()
for i in range(num_clusters):
# Sort the terms and print top 3 terms
center_terms = dict(zip(terms, cluster_centers[i]))
sorted_terms = sorted(center_terms, key=center_terms.get, reverse=True)
print(sorted_terms[:3])
fifa = pd.read_csv("datasets/fifa3.csv")
# Print the size of the clusters
fifa.groupby("cluster_labels")['ID'].count()
# Print the mean value of wages in each cluster
fifa.groupby(["cluster_labels"])['eur_wage'].mean()
the cluster sizes are not very different, and there are no significant differences that can be seen in the wages. Further analysis is required to validate these clusters.
features= ['pac', 'sho', 'pas', 'dri', 'def', 'phy']
scaled_features = ['scaled_pac',
'scaled_sho',
'scaled_pas',
'scaled_dri',
'scaled_def',
'scaled_phy']
fifa[scaled_features] = fifa[features].apply(lambda x: whiten(x))
# Create centroids with kmeans for 2 clusters
cluster_centers,_ = kmeans(fifa[scaled_features], 2)
# Assign cluster labels and print cluster centers
fifa['cluster_labels'], _ = vq(fifa[scaled_features], cluster_centers)
fifa.groupby("cluster_labels")[scaled_features].mean()
# Plot cluster centers to visualize clusters
fifa.groupby('cluster_labels')[scaled_features].mean().plot(legend=True, kind="bar")
plt.show()
# Get the name column of first 5 players in each cluster
for cluster in fifa['cluster_labels'].unique():
print(cluster, fifa[fifa['cluster_labels'] == cluster]['name'].values[:5])
the top players in each cluster are representative of the overall characteristics of the cluster  one of the clusters primarily represents attackers, whereas the other represents defenders. Surprisingly, a top goalkeeper Manuel Neuer is seen in the attackers group, but he is known for going out of the box and participating in open play, which are reflected in his FIFA 18 attributes.