# Unsupervised Learning in Python

Unsupervised learning encompasses a variety of techniques in machine learning, from clustering to dimension reduction to matrix factorization. In this blog, we'll explore the fundamentals of unsupervised learning and implement the essential algorithms using scikit-learn and scipy.

- Overview
- Libraries
- Clustering for dataset exploration
- Unsupervised Learning
- Evaluating a clustering
- Iris:clusters vs species- k-means found 3 clusters amongst the iris samples
- Cross tabulation with pandas
- Crosstab of labels and species
- Measuring clustering quality
- Inertia measures clustering quality
- The number of clusters
- How many clusters to choose?
- How many clusters of grain?
- Evaluating the grain clustering

- Transforming features for better clusterings

- Visualization with hierarchical clustering and t-SNE
- Decorrelating the data and dimension reduction
- Visualizing the PCA transformation
- Intrinsic dimension
- Intrinsic dimension of a flight path
- Intrinsic dimension
- Versicolor dataset
- PCA identifies intrinsic dimension
- PCA of the versicolor samples
- PCA features are ordered by variance descending
- Variance and intrinsic dimension
- Plotting the variances of PCA features
- Intrinsic dimension can be ambiguous
- The first principal component
- Variance of the PCA features

- Dimension reduction with PCA
- Dimension reduction
- Dimension reduction with PCA
- Dimension reduction with PCA
- Dimension reduction of iris dataset
- Dimension reduction with PCA
- Word frequency arrays
- Sparse arrays and csr_matrix
- TruncatedSVD and csr_matrix
- Dimension reduction of the fish measurements
- A tf-idf word-frequency array
- Clustering Wikipedia part I
- Clustering Wikipedia part II

- Discovering interpretable features

## Overview

Say we have a collection of customers with a variety of characteristics such as age, location, and financial history, and we wish to discover patterns and sort them into clusters. Or perhaps we have a set of texts, such as wikipedia pages, and we wish to segment them into categories based on their content. This is the world of unsupervised learning, called as such because we are not guiding, or supervising, the pattern discovery by some prediction task, but instead uncovering hidden structure from unlabeled data. Unsupervised learning encompasses a variety of techniques in machine learning, from clustering to dimension reduction to matrix factorization. We'll explore the fundamentals of unsupervised learning and implement the essential algorithms using scikit-learn and scipy. We will explore how to cluster, transform, visualize, and extract insights from unlabeled datasets, and end the session by building a recommender system to recommend popular musical artists.

```
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import (StandardScaler,
Normalizer,
normalize,
MaxAbsScaler)
from sklearn.pipeline import make_pipeline
from sklearn.manifold import TSNE
from sklearn.decomposition import (PCA,
TruncatedSVD,
NMF)
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import (linkage,
dendrogram,
fcluster)
from scipy.stats import pearsonr
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use("ggplot")
import warnings
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")
warnings.filterwarnings("ignore", message="invalid value encountered in sign")
```

## Unsupervised Learning

## Unsupervised learning

- Unsupervised learning finds patterns in data
- E.g.
clusteringcustomers by their purchases- Compressing the data using purchase patterns
(dimension reduction)## Supervised vs unsupervised learning

Supervisedlearning finds patterns for a prediction task- E.g. classify tumors as benign or cancerous
(labels)- Unsupervised learning finds patterns in data

- ... but
withouta specific prediction task in mind## Iris dataset

- Measurements of many iris plants
^{1}- 3 species of iris:setosa, versicolor, virginica - Petal length, petal width, sepal length, sepal width (the features of the dataset) ### Iris data is 4-dimensional
- Iris samples are points in 4 dimensional space
- Dimension = number of features
- Dimension too high to visualize!
- ... but unsupervised learning gives insight
## k-means clustering

- Finds clusters of samples
- Number of clusters must be specified
- Implemented in sklearn ("scikit-learn")
## Cluster labels for new samples

- New samples can be assigned to existing clusters
- k-means remembers the mean of each cluster (the "centroids")
- Finds the nearest centroid to each new sample
## Scatter plots

- Scatter plot of sepal length vs petal length
- Each point represents an iris sample
- Color points by cluster labels
- PyPlot (matplotlib.pyplot) TODO:add scatter plot

```
iris = datasets.load_iris()
iris.keys()
```

```
samples = iris.data
samples[:5]
```

```
model = KMeans(n_clusters=3)
model.fit(samples)
```

```
labels = model.predict([[5.8, 4. , 1.2, 0.2]])
labels
```

```
new_samples = [[ 5.7,4.4,1.5,0.4] ,[ 6.5,3. ,5.5,1.8] ,[ 5.8,2.7,5.1,1.9]]
model.predict(new_samples)
```

```
labels_iris = model.predict(samples)
```

```
xs_iris = samples[:,0]
ys_iris = samples[:,2]
_ = sns.scatterplot(xs_iris, ys_iris, hue=labels_iris)
plt.show()
```

```
points = pd.read_csv("datasets/points.csv").values
points[:5]
```

```
xs_points = points[:,0]
ys_points = points[:,1]
_ = sns.scatterplot(xs_points, ys_points)
plt.show()
```

There are three clusters

```
new_points = pd.read_csv("datasets/new_points.csv").values
new_points[:5]
```

```
# Create a KMeans instance with 3 clusters: model
model_points = KMeans(n_clusters=3)
# Fit model to points
model_points.fit(points)
# Determine the cluster labels of new_points: labels
labels_points = model_points.predict(new_points)
# Print cluster labels of new_points
print(labels_points)
```

We've successfully performed k-Means clustering and predicted the labels of new points. But it is not easy to inspect the clustering by just looking at the printed labels. A visualization would be far more useful. We'll inspect the clustering with a scatter plot!

```
# Assign the columns of new_points: xs and ys
xs_np = new_points[:,0]
ys_np = new_points[:,1]
# Make a scatter plot of xs and ys, using labels to define the colors
_ = plt.scatter(xs_np, ys_np, c=labels_points, alpha=.5)
# Assign the cluster centers: centroids
centroids_p = model_points.cluster_centers_
# Assign the columns of centroids: centroids_x, centroids_y
centroids_x_p = centroids_p[:,0]
centroids_y_p = centroids_p[:,1]
# Make a scatter plot of centroids_x and centroids_y
_ = plt.scatter(centroids_x_p, centroids_y_p, marker="D", s=50)
plt.show()
```

The clustering looks great! But how can we be sure that 3 clusters is the correct choice? In other words, how can we evaluate the quality of a clustering?

## Evaluating a clustering

- Can check correspondence with e.g. iris species
- ... but what if there are no species to check against?
- Measure quality of a clustering
- Informs choice of how many clusters to look for

## Cross tabulation with pandas

- Clusters vs species is a "cross-tabulation"

```
iris_ct = pd.DataFrame({'labels':labels_iris, 'species':iris.target})
iris_ct.head()
```

```
np.unique(iris.target)
```

```
iris_ct.species.unique()
```

```
iris.target_names
```

```
iris_ct['species'] = iris_ct.species.map({0:'setosa', 1:'versicolor', 2:'virginica'})
iris_ct.head()
```

```
pd.crosstab(iris_ct.labels, iris_ct.species)
```

### Measuring clustering quality

- Using only samples and their cluster labels
- A good clustering has tight clusters
- ... and samples in each cluster bunched together

## Inertia measures clustering quality

- Measures how spread out the clusters are (lower is better)
- Distance from each sample to centroid of its cluster
- After
`fit()`

, available as attribute`inertia_`

- k-means attempts to minimize the inertia when choosing clusters

```
model.inertia_
```

```
samples_grain = pd.read_csv("datasets/samples_grain.csv").values
samples_grain[:5]
```

an array `samples`

contains the measurements (such as area, perimeter, length, and several others) of samples of grain. What's a good number of clusters in this case?

```
ks_grain = range(1, 6)
inertias_grain = []
for k in ks_grain:
# Create a KMeans instance with k clusters: model
model_grain = KMeans(n_clusters=k)
# Fit model to samples
model_grain.fit(samples_grain)
# Append the inertia to the list of inertias
inertias_grain.append(model_grain.inertia_)
# Plot ks vs inertias
plt.plot(ks_grain, inertias_grain, '-o')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks_grain)
plt.show()
```

The inertia decreases very slowly from 3 clusters to 4, so it looks like 3 clusters would be a good choice for this data.

```
varieties = pd.read_csv("datasets/varieties.csv")["0"].to_list()
varieties[:5]
```

list `varieties`

gives the grain variety for each sample.

```
# Create a KMeans model with 3 clusters: model
model_grain = KMeans(n_clusters=3)
# Use fit_predict to fit model and obtain cluster labels: labels
labels_grain = model_grain.fit_predict(samples_grain)
# Create a DataFrame with labels and varieties as columns: df
grain_df = pd.DataFrame({'labels': labels_grain, 'varieties': varieties})
# Create crosstab: ct
ct_grain = pd.crosstab(grain_df.labels, grain_df.varieties)
# Display ct
ct_grain
```

The cross-tabulation shows that the 3 varieties of grain separate really well into 3 clusters. But depending on the type of data you are working with, the clustering may not always be this good. Is there anything we can do in such situations to improve the clustering?

## Transforming features for better clusterings

## Piedmont wines dataset

^{2}

- 178 samples from 3 distinct varieties of red wine:Barolo, Grignolino and Barbera- Features measure chemical composition e.g. alcohol content
- ... also visual properties like “color intensity” ### Feature variancesfeature
- The wine features have very different variances!
- Variance of a feature measures spread of its values
## StandardScaler

- In kmeans:feature variance = feature influence - StandardScaler transforms each feature to have mean 0 and variance 1
- Features are said to be "standardized" ### Similar methods
`StandardScaler`

and`KMeans`

have similar methods- Use
`fit()`

/`transform()`

with`StandardScaler`

- Use
`fit()`

/`predict()`

with`KMeans`

`StandardScaler`

, then`KMeans`

- Need to perform two steps:
`StandardScaler`

, then`KMeans`

- Use sklearn pipeline to combine multiple steps- Data flows from one step into the next
## sklearn preprocessing steps

`StandardScaler`

is a "preprocessing" step`MaxAbsScaler`

and`Normalizer`

are other examples

```
samples_fish = pd.read_csv("datasets/samples_fish.csv").values
samples_fish[:5]
```

an array `samples_fish`

^{3} gives measurements of fish. Each row represents an individual fish. The measurements, such as weight in grams, length in centimeters, and the percentage ratio of height to length, have very different scales. In order to cluster this data effectively, we'll need to standardize these features first. We'll build a pipeline to standardize and cluster the data.

```
# Create scaler: scaler_fish
scaler_fish = StandardScaler()
# Create KMeans instance: kmeans_fish
kmeans_fish = KMeans(n_clusters=4)
# Create pipeline: pipeline_fish
pipeline_fish = make_pipeline(scaler_fish, kmeans_fish)
```

Now that We've built the pipeline, we'll use it to cluster the fish by their measurements.

```
species_fish = pd.read_csv("datasets/species_fish.csv")["0"].to_list()
species_fish[:5]
```

We'll now use the standardization and clustering pipeline to cluster the fish by their measurements, and then create a cross-tabulation to compare the cluster labels with the fish species.

```
# Fit the pipeline to samples
pipeline_fish.fit(samples_fish)
# Calculate the cluster labels: labels_fish
labels_fish = pipeline_fish.predict(samples_fish)
# Create a DataFrame with labels and species as columns: df
fish_df = pd.DataFrame({'labels':labels_fish, 'species':species_fish})
# Create crosstab: ct
ct_fish = pd.crosstab(fish_df.labels, fish_df.species)
# Display ct
ct_fish
```

It looks like the fish data separates really well into 4 clusters!

```
movements = pd.read_csv("datasets/movements.csv").values
movements[:5]
```

A NumPy array `movements`

of daily price movements from 2010 to 2015 (obtained from Yahoo! Finance), where each row corresponds to a company, and each column corresponds to a trading day.

Some stocks are more expensive than others. To account for this, we will include a `Normalizer`

at the beginning of the pipeline. The `Normalizer`

will separately transform each company's stock price to a relative scale before the clustering begins.

**Note:**

`Normalizer()`

is different to `StandardScaler()`

. While `StandardScaler()`

standardizes features (such as the features of the fish data) by removing the mean and scaling to unit variance, `Normalizer()`

**rescales**each sample - here, each company’s stock price - independently of the other.

```
# Create a normalizer: normalizer_movements
normalizer_movements = Normalizer()
# Create a KMeans model with 10 clusters: kmeans_movements
kmeans_movements = KMeans(n_clusters=10)
# Make a pipeline chaining normalizer and kmeans: pipeline_movements
pipeline_movements = make_pipeline(normalizer_movements, kmeans_movements)
# Fit pipeline to the daily price movements
pipeline_movements.fit(movements)
```

Now that the pipeline has been set up, we can find out which stocks move together

```
companies_movements=pd.read_csv("datasets/companies_movements.csv")
companies_movements.head()
```

```
companies_movements=companies_movements["0"].to_list()
companies_movements[:5]
```

a list `companies_movements`

of the company names

```
# Predict the cluster labels: labels_movements
labels_movements = pipeline_movements.predict(movements)
# Create a DataFrame aligning labels and companies: df
movements_df = pd.DataFrame({'labels': labels_movements, 'companies': companies_movements})
# Display df sorted by cluster label
movements_df.sort_values("labels")
```

# Visualization with hierarchical clustering and t-SNE

We'll Explore two unsupervised learning techniques for data visualization, hierarchical clustering and t-SNE. Hierarchical clustering merges the data samples into ever-coarser clusters, yielding a tree visualization of the resulting cluster hierarchy. t-SNE maps the data samples into 2d space so that the proximity of the samples to one another can be visualized.

## Visualizing hierarchies

## Visualisations communicate insight

- "t-SNE" :Creates a 2D map of a dataset (later) - "Hierarchical clustering"
## A hierarchy of groups

- Groups of living things can form a hierarchy
- Clusters are contained in one another
## Eurovision scoring dataset

^{4}

- Countries gave scores to songs performed at the Eurovision 2016
- 2D array of scores
- Rows are countries, columns are songs
## Hierarchical clustering

- Every country begins in a separate cluster
- At each step, the two closest clusters are merged
- Continue until all countries in a single cluster
- This is “agglomerative” hierarchical clustering
## The dendrogram of a hierarchical clustering

- Read from the bottom up
- Vertical lines represent clusters

With 5 data samples, there would be 4 merge operations, and with 6 data samples, there would be 5 merges, and so on.

```
# Calculate the linkage: mergings_g
mergings_g = linkage(samples_grain, method='complete')
# Plot the dendrogram, using varieties as labels
plt.figure(figsize=(20,7))
dendrogram(mergings_g,
labels=varieties,
leaf_rotation=90,
leaf_font_size=8,
)
plt.show()
```

Dendrograms are a great way to illustrate the arrangement of the clusters produced by hierarchical clustering.

### Hierarchies of stocks

We used k-means clustering to cluster companies according to their stock price movements. Now, we'll perform hierarchical clustering of the companies. SciPy hierarchical clustering doesn't fit into a sklearn pipeline, so we'll need to use the `normalize()`

function from `sklearn.preprocessing`

instead of `Normalizer`

.

```
# Normalize the movements: normalized_movements
normalized_movements = normalize(movements)
# Calculate the linkage: mergings
mergings_m = linkage(normalized_movements, method="complete")
# Plot the dendrogram
plt.figure(figsize=(20,10))
dendrogram(mergings_m, labels=companies_movements, leaf_font_size=12, leaf_rotation=90)
plt.show()
```

You can produce great visualizations such as this with hierarchical clustering, but it can be used for more than just visualizations.

## Cluster labels in hierarchical clustering

## Cluster labels in hierarchical clustering

- Not only a visualisation tool!
- Cluster labels at any intermediate stage can be recovered
- For use in e.g. cross-tabulations
## Intermediate clusterings & height on dendrogram

- E.g. at height 15:Bulgaria, Cyprus, Greece are one cluster - Russia and Moldova are another
- Armenia in a cluster on its own ### Dendrograms show cluster distances
- Height on dendrogram = distance between merging clusters
- E.g. clusters with only Cyprus and Greece had distance approx. 6
- This new cluster distance approx. 12 from cluster with only Bulgaria
## Intermediate clusterings & height on dendrogram

- Height on dendrogram specifies max. distance between merging clusters
- Don't merge clusters further apart than this (e.g. 15)
## Distance between clusters

- Defined by a "linkage method"
- Specified via method parameter, e.g. linkage(samples, method="complete")
- In "complete" linkage:distance between clusters is max. distance between their samples - Different linkage method, different hierarchical clustering!
## Extracting cluster labels

- Use the
`fcluster`

method- Returns a NumPy array of cluster labels

the linkage method defines how the distance between clusters is measured. In *complete* linkage, the distance between clusters is the distance between the furthest points of the clusters. In *single* linkage, the distance between clusters is the distance between the closest points of the clusters.

```
samples_eurovision = pd.read_csv("datasets/samples_eurovision.csv").values
samples_eurovision[:5]
```

```
country_names_eurovision = pd.read_csv("datasets/country_names_eurovision.csv")["0"].to_list()
country_names_eurovision[:5]
```

```
len(country_names_eurovision)
```

```
# Calculate the linkage: mergings
mergings_ev = linkage(samples_eurovision, method='single')
# Plot the dendrogram
plt.figure(figsize=(20,9))
dendrogram(mergings_ev, labels=country_names_eurovision, leaf_rotation=90, leaf_font_size=12)
plt.show()
```

```
# Use fcluster to extract labels: labels_g
labels_g = fcluster(mergings_g,6, criterion='distance')
# Create a DataFrame with labels and varieties as columns: df
grain_df = pd.DataFrame({'labels': labels_g, 'varieties': varieties})
# Create crosstab: ct
grain_ct = pd.crosstab(grain_df.labels, grain_df.varieties)
# Display ct
print(grain_ct)
```

We've now mastered the fundamentals of k-Means and agglomerative hierarchical clustering. Next, we'll explore t-SNE, which is a powerful tool for visualizing high dimensional data.

## t-SNE for 2-dimensional maps

## t-SNE for 2-dimensional maps

- t-SNE = “t-distributed stochastic neighbor embedding”
- Maps samples to 2D space (or 3D)
- Map approximately preserves nearness of samples
- Great for inspecting datasets
## t-SNE on the iris dataset

- Iris dataset has 4 measurements, so samples are 4-dimensional
- t-SNE maps samples to 2D space
- t-SNE didn't know that there were different species
- ... yet kept the species mostly separate

## Interpreting t-SNE scatter plots

- “versicolor” and “virginica” harder to distinguish from one another
- Consistent with k-means inertia plot:could argue for 2 clusters, or for 3
## t-SNE in sklearnIn

- 2D NumPy array samples
- List species giving species of labels as number (0, 1, or 2)

```
samples[:5]
```

```
iris.target[:5]
```

```
model_i = TSNE(learning_rate=100)
transformed_i = model_i.fit_transform(samples)
xs_i = transformed_i[:,0]
ys_i = transformed_i[:,1]
plt.scatter(xs_i, ys_i, c=iris.target)
plt.show()
```

## t-SNE has only

`fit_transform()`

- Has a
`fit_transform()`

method- Simultaneously fits the model and transforms the data
- Has no separate
`fit()`

or`transform()`

methods- Can’t extend the map to include new data samples
- Must start over each time!
## t-SNE learning rate

- Choose learning rate for the dataset
- Wrong choice:points bunch together- Try values between 50 and 200 ### Different every time
- t-SNE features are different every time
- Piedmont wines, 3 runs, 3 different scatter plots!
- ... however:The wine varieties (=colors) have same position relative to one another

```
variety_numbers_g = pd.read_csv("datasets/variety_numbers_grains.csv")["0"].to_list()
variety_numbers_g[:5]
```

```
samples_grain[:5]
```

```
# Create a TSNE instance: model
model_g = TSNE(learning_rate=200)
# Apply fit_transform to samples: tsne_features
tsne_features_g = model_g.fit_transform(samples_grain)
# Select the 0th feature: xs
xs_g = tsne_features_g[:,0]
# Select the 1st feature: ys
ys_g = tsne_features_g[:,1]
# Scatter plot, coloring by variety_numbers_g
plt.scatter(xs_g, ys_g, c=variety_numbers_g)
plt.show()
```

the t-SNE visualization manages to separate the 3 varieties of grain samples. But how will it perform on the stock data?

### t-SNE map of the stock market

t-SNE provides great visualizations when the individual samples can be labeled. We'll apply t-SNE to the company stock price data. A scatter plot of the resulting t-SNE features, labeled by the company names, gives a map of the stock market! The stock price movements for each company are available as the array `normalized_movements`

. The list `companies`

gives the name of each company.

```
# Create a TSNE instance: model
model_m = TSNE(learning_rate=50)
# Apply fit_transform to normalized_movements: tsne_features
tsne_features_m = model_m.fit_transform(normalized_movements)
# Select the 0th feature: xs
xs_m = tsne_features_m[:,0]
# Select the 1th feature: ys
ys_m = tsne_features_m[:,1]
# Scatter plot
plt.figure(figsize=(20,14))
plt.scatter(xs_m, ys_m)
# Annotate the points
for x, y, company in zip(xs_m, ys_m, companies_movements):
plt.annotate(company, (x, y), fontsize=12)
plt.show()
```

It's visualizations such as this that make t-SNE such a powerful tool for extracting quick insights from high dimensional data.

# Decorrelating the data and dimension reduction

Dimension reduction summarizes a dataset using its common occuring patterns. We'll explore the most fundamental of dimension reduction techniques, "Principal Component Analysis" ("PCA"). PCA is often used before supervised learning to improve model performance and generalization. It can also be useful for unsupervised learning. For example, you'll employ a variant of PCA will allow you to cluster Wikipedia articles by their content!

## Visualizing the PCA transformation

## Dimension reduction

- More efficient storage and computation
- Remove less-informative "noise" features
- ... which cause problems for prediction tasks, e.g. classification, regression
## Principal Component Analysis

- PCA = "Principal Component Analysis"
- Fundamental dimension reduction technique
- First step "decorrelation" (considered below)
- Second step reduces dimension (considered later)
## PCA aligns data with axes

- Rotates data samples to be aligned with axes
- Shifts data samples so they have mean 0
- No information is lostPCA

## PCA follows the fit/transform pattern

`PCA`

a scikit-learn component like`KMeans`

or`StandardScaler`

`fit()`

learns the transformation from given data`transform()`

applies the learned transformation`transform()`

can also be applied to new data

```
wine = pd.read_csv("datasets/wine.csv")
wine.head()
```

```
samples_wine = wine[['total_phenols', 'od280']].values
samples_wine[:5]
```

```
model_w = PCA()
model_w.fit(samples_wine)
```

```
transformed_w = model_w.transform(samples_wine)
```

```
transformed_w[:5]
```

## PCA features are not correlated

- Features of dataset are often correlated, e.g. total_phenols and od280
- PCA aligns the data with axes
- Resulting PCA features are not linearly correlated ("decorrelation")PCA

## Pearson correlation

- Measures linear correlation of features
- Value between -1 and 1
- Value of 0 means no linear correlation

## Principal components

- "Principal components" = directions of variance
- PCA aligns principal components with the axes
- Available as
`components_`

attribute of PCA object- Each row defines displacement from mean

```
model_w.components_
```

```
# Assign the 0th column of grains: width
width_g = samples_grain[:,4]
# Assign the 1st column of grains: length
length_g = samples_grain[:,3]
# Scatter plot width vs length
plt.scatter(width_g, length_g)
plt.axis('equal')
plt.show()
# Calculate the Pearson correlation
correlation_g, pvalue_g = pearsonr(width_g, length_g)
# Display the correlation
correlation_g
```

As you would expect, the width and length of the grain samples are highly correlated.

```
grains = pd.read_csv("datasets/grains.csv").values
grains[:5]
```

```
# Create PCA instance: model_g
model_g = PCA()
# Apply the fit_transform method of model to grains: pca_features
pca_features_g = model_g.fit_transform(grains)
# Assign 0th column of pca_features: xs
xs_g = pca_features_g[:,0]
# Assign 1st column of pca_features: ys
ys_g = pca_features_g[:,1]
# Scatter plot xs vs ys
plt.scatter(xs_g, ys_g)
plt.axis('equal')
plt.show()
# Calculate the Pearson correlation of xs and ys
correlation_g, pvalue_g = pearsonr(xs_g, ys_g)
# Display the correlation
correlation_g
```

principal components have to align with the axes of the point cloud.

## Intrinsic dimension

## Intrinsic dimension of a flight path

- 2 features:longitude and latitude at points along a flight path - Dataset appears to be 2-dimensional
- But can approximate using one feature: displacement along flight path
- Is intrinsically 1-dimensiona

## Intrinsic dimension

- Intrinsic dimension = number of features needed to approximate the dataset
- Essential idea behind dimension reduction
- What is the most compact representation of the samples?
- Can be detected with PCA
## Versicolor dataset

- "versicolor", one of the iris species
- Only 3 features:sepal length, sepal width, and petal width- Samples are points in 3D space ### Versicolor dataset has intrinsic dimension 2
- Samples lie close to a flat 2-dimensional sheet
- So can be approximated using 2 features

## PCA identifies intrinsic dimension

- Scatter plots work only if samples have 2 or 3 features
- PCA identifies intrinsic dimension when samples have any number of features
- Intrinsic dimension = number of PCA features with significant variance
## PCA of the versicolor samples

## PCA features are ordered by variance descending

## Variance and intrinsic dimension

- Intrinsic dimension is number of PCA features with significant variance
- In our example:the first two PCA features - So intrinsic dimension is 2

```
iris.target_names
```

```
versicolor = pd.DataFrame(iris.data, columns=iris.feature_names)
versicolor['target'] = iris.target
versicolor = versicolor[versicolor.target==1]
versicolor.head()
```

```
samples_versicolor = versicolor[['sepal length (cm)', 'sepal width (cm)', 'petal width (cm)']].values
samples_versicolor[:5]
```

```
pca_versicolor = PCA()
pca_versicolor.fit(samples_versicolor)
```

```
features_versicolor = range(pca_versicolor.n_components_)
plt.bar(features_versicolor, pca_versicolor.explained_variance_)
plt.xticks(features_versicolor)
plt.xlabel("PCA feature")
plt.ylabel("Variance")
plt.show()
```

```
# Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])
# Fit model to points
model_g.fit(grains)
# Get the mean of the grain samples: mean
mean_g = model_g.mean_
# Get the first principal component: first_pc
first_pc_g = model_g.components_[0,:]
# Plot first_pc as an arrow, starting at mean
plt.arrow(mean_g[0], mean_g[1], first_pc_g[0], first_pc_g[1], color='blue', width=0.01)
# Keep axes on same scale
plt.axis('equal')
plt.show()
```

This is the direction in which the grain data varies the most.

```
# Create scaler: scaler
scaler_fish = StandardScaler()
# Create a PCA instance: pca
pca_fish = PCA()
# Create pipeline: pipeline
pipeline_fish = make_pipeline(scaler_fish, pca_fish)
# Fit the pipeline to 'samples'
pipeline_fish.fit(samples_fish)
# Plot the explained variances
features_fish = range(pca_fish.n_components_)
plt.bar(features_fish, pca_fish.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features_fish)
plt.show()
```

It looks like PCA features 0 and 1 have significant variance. Since PCA features 0 and 1 have significant variance, the intrinsic dimension of this dataset appears to be 2.

## Dimension reduction with PCA

## Dimension reduction

- Represents same data, using less features
- Important part of machine-learning pipelines
- Can be performed using PCA
## Dimension reduction with PCA

- PCA features are in decreasing order of variance
- Assumes the low variance features are "noise"
- ... and high variance features are informative
## Dimension reduction with PCA

- Specify how many features to keep
- E.g.
`PCA(n_components=2)`

- Keeps the first 2 PCA features
- Intrinsic dimension is a good choice
## Dimension reduction of iris dataset

`samples`

= array of iris measurements (4 features)`species`

= list of iris species numbers## Dimension reduction with PCA

- Discards low variance PCA features
- Assumes the high variance features are informative
- Assumption typically holds in practice (e.g. for iris)
## Word frequency arrays

- Rows represent documents, columns represent words
- Entries measure presence of each word in each document
- ... measure using "tf-idf" (more later)
## Sparse arrays and csr_matrix

- Array is "sparse":most entries are zero - Can use
`scipy.sparse.csr_matrix`

instead of NumPy array`csr_matrix`

remembers only the non-zero entries (saves space!)## TruncatedSVD and csr_matrix

- scikit-learn PCA doesn't support
`csr_matrix`

- Use scikit-learn
`TruncatedSVD`

instead- Performs same transformation

```
scaled_samples_fish = pd.read_csv("datasets/scaled_samples_fish.csv").values
scaled_samples_fish[:5]
```

```
# Create a PCA model with 2 components: pca
pca_fish = PCA(n_components=2)
# Fit the PCA instance to the scaled samples
pca_fish.fit(scaled_samples_fish)
# Transform the scaled samples: pca_features
pca_features_fish = pca_fish.transform(scaled_samples_fish)
# Print the shape of pca_features
pca_features_fish.shape
```

We've successfully reduced the dimensionality from 6 to 2.

### A tf-idf word-frequency array

We'll create a tf-idf word frequency array for a toy collection of documents. For this, we will use the `TfidfVectorizer`

from sklearn. It transforms a list of documents into a word frequency array, which it outputs as a `csr_matrix`

. It has `fit()`

and `transform()`

methods like other sklearn objects.

```
documents = ['cats say meow', 'dogs say woof', 'dogs chase cats']
```

```
# Create a TfidfVectorizer: tfidf
tfidf_d = TfidfVectorizer()
# Apply fit_transform to document: csr_mat
csr_mat_d = tfidf_d.fit_transform(documents)
# Print result of toarray() method
print(csr_mat_d.toarray())
# Get the words: words
words_d = tfidf_d.get_feature_names()
# Print words
words_d
```

### Clustering Wikipedia part I

`TruncatedSVD`

is able to perform PCA on sparse arrays in `csr_matrix`

format, such as word-frequency arrays. We will cluster some popular pages from Wikipedia ^{5}. We will build the pipeline and apply it to the word-frequency array of some Wikipedia articles.

The Pipeline object will be consisting of a `TruncatedSVD`

followed by `KMeans`

.

```
# Create a TruncatedSVD instance: svd
svd_wp = TruncatedSVD(n_components=50)
# Create a KMeans instance: kmeans
kmeans_wp = KMeans(n_clusters=6)
# Create a pipeline: pipeline
pipeline_wp = make_pipeline(svd_wp, kmeans_wp)
```

Now that we have set up the pipeline, we will use to cluster the articles.

```
wv = pd.read_csv("datasets/Wikipedia_articles/wikipedia-vectors.csv", index_col=0)
articles = csr_matrix(wv.transpose())
articles_titles = list(wv.columns)
```

```
# Fit the pipeline to articles
pipeline_wp.fit(articles)
# Calculate the cluster labels: labels
labels_wp = pipeline_wp.predict(articles)
# Create a DataFrame aligning labels and titles: df
wp = pd.DataFrame({'label': labels_wp, 'article': articles_titles})
# Display df sorted by cluster label
wp.sort_values("label")
```

# Discovering interpretable features

We'll explore a dimension reduction technique called "Non-negative matrix factorization" ("NMF") that expresses samples as combinations of interpretable parts. For example, it expresses documents as combinations of topics, and images in terms of commonly occurring visual patterns. We'll also explore how to use NMF to build recommender systems that can find us similar articles to read, or musical artists that match your listening history!

## Non-negative matrix factorization (NMF)

- NMF = "non-negative matrix factorization"
- Dimension reduction technique
- NMF models are interpretable (unlike PCA)
- Easy to interpret means easy to explain!
- However, all sample features must be non-negative (>= 0)

## Interpretable parts

- NMF expresses documents as combinations of topics (or "themes")
- NMF expresses images as combinations of patterns
## Using scikit-learn NMF

- Follows
`fit()`

/`transform()`

pattern- Must specify number of components e.g.
`NMF(n_components=2)`

- Works with NumPy arrays and with
`csr_matrix`

## Example word-frequency array

- Word frequency array, 4 words, many documents
- Measure presence of words in each document using "tf-idf"
- "tf" = frequency of word in document
- "idf" reduces influence of frequent words
## NMF components

- NMF has components
- ... just like PCA has principal components
- Dimension of components = dimension of samples
- Entries are non-negative
## NMF features

- NMF feature values are non-negative
- Can be used to reconstruct the samples
- ... combine feature values with components
## Sample reconstruction

- Multiply components by feature values, and add up
- Can also be expressed as a product of matrices
- This is the "Matrix Factorization" in "NMF"
## NMF fits to non-negative data, only

- Word frequencies in each document
- Images encoded as arrays
- Audio spectrograms
- Purchase histories on e-commerce sites
- ... and many more!

```
# Create an NMF instance: model
model_wp = NMF(n_components=6)
# Fit the model to articles
model_wp.fit(articles)
# Transform the articles: nmf_features
nmf_features_wp = model_wp.transform(articles)
# Print the NMF features
nmf_features_wp[:5]
```

### NMF features of the Wikipedia articles

**Note:**When investigating the features, notice that for both actors, the NMF feature 3 has by far the highest value. This means that both articles are reconstructed using mainly the 3rd NMF component. We’ll see why: NMF components represent topics (for instance, acting!).

```
# Create a pandas DataFrame: df
wp_df = pd.DataFrame(nmf_features_wp, index=articles_titles)
# Print the row for 'Anne Hathaway'
display(wp_df.loc[['Anne Hathaway']])
# Print the row for 'Denzel Washington'
display(wp_df.loc[['Denzel Washington']])
```

```
articles.shape
```

NMF components are topics

## NMF components

- For documents:- NMF components represent topics - NMF features combine topics into documents
- For images, NMF components are parts of images
## Grayscale images

- "Grayscale" image = no colors, only shades of gray
- Measure pixel brightness
- Represent with value between 0 and 1 (0 is black)
- Convert to 2D array
## Encoding a collection of images

- Collection of images of the same size
- Encode as 2D array
- Each row corresponds to an image
- Each column corresponds to a pixel
- ... can apply NMF!

### NMF learns topics of documents

when NMF is applied to documents, the components correspond to topics of documents, and the NMF features reconstruct the documents from the topics. We will using the NMF model that we built earlier using the Wikipedia articles. 3rd NMF feature value was high for the articles about actors Anne Hathaway and Denzel Washington. We will identify the topic of the corresponding NMF component.

```
words = pd.read_csv("datasets/Wikipedia_articles/words.csv")["0"].to_list()
words[:5]
```

```
# Create a DataFrame: components_df
components_df = pd.DataFrame(model_wp.components_, columns=words)
# Print the shape of the DataFrame
print(components_df.shape)
# Select row 3: component
component = components_df.iloc[3]
# Print result of nlargest
component.nlargest()
```

```
samples_images = pd.read_csv("datasets/samples_images.csv")
x=samples_images.isnull().sum()
x[x>0]
```

```
samples_images=samples_images.values
np.isinf(samples_images).any()
```

```
np.isnan(samples_images).any()
```

```
# Select the 0th row: digit
digit_i = samples_images[0,:]
# Print digit
print(digit_i)
# Reshape digit to a 13x8 array: bitmap
bitmap_i = digit_i.reshape(13,8)
# Print bitmap
print(bitmap_i)
# Use plt.imshow to display bitmap
plt.imshow(bitmap_i, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()
```

```
def show_as_image(sample):
"""displays the image encoded by any 1D array"""
bitmap = sample.reshape((13, 8))
plt.figure()
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()
```

```
show_as_image(samples_images[99, :])
```

```
# Create an NMF model: model
model_i = NMF(n_components=7)
# Apply fit_transform to samples: features
features_i = model_i.fit_transform(samples_images)
# Call show_as_image on each component
for component in model_i.components_:
show_as_image(component)
# Assign the 0th row of features: digit_features
digit_features_i = features_i[0,:]
# Print digit_features
digit_features_i
```

```
# Create a PCA instance: model
model_i = PCA(n_components=7)
# Apply fit_transform to samples: features
features_i = model_i.fit_transform(samples_images)
# Call show_as_image on each component
for component in model_i.components_:
show_as_image(component)
```

the components of PCA do not represent meaningful parts of images of LED digits!

## Building recommender systems using NMF

## Finding similar articles

- Engineer at a large online newspaper
- Task:recommend articles similar to article being read by customer - Similar articles should have similar topics
## Strategy

- Apply NMF to the word-frequency array
- NMF feature values describe the topics
- ... so similar documents have similar NMF feature values
- Compare NMF feature values?
## Versions of articles

- Different versions of the same document have same topic proportions
- ... exact feature values may be different!
- E.g. because one version uses many meaningless words
- But all versions lie on the same line through the origin
## Cosine similarity

- Uses the angle between the lines
- Higher values means more similar
- Maximum value is 1, when angle is 0

```
# Normalize the NMF features: norm_features
norm_features_wp = normalize(nmf_features_wp)
# Create a DataFrame: df
wp_df = pd.DataFrame(norm_features_wp, index=articles_titles)
# Select the row corresponding to 'Cristiano Ronaldo': article
article = wp_df.loc['Cristiano Ronaldo']
# Compute the dot products: similarities
similarities = wp_df.dot(article)
# Display those with the largest cosine similarity
similarities.nlargest()
```

```
artists_df = pd.read_csv("datasets/Musical_artists/scrobbler-small-sample.csv")
artists = csr_matrix(artists_df)
```

```
# Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()
# Create an NMF model: nmf
nmf = NMF(n_components=20)
# Create a Normalizer: normalizer
normalizer = Normalizer()
# Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)
# Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)
```

```
artist_names = pd.read_csv("datasets/Musical_artists/artists.csv")["0"].to_list()
```

```
# Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)
# Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen']
# Compute cosine similarities: similarities
similarities = df.dot(artist)
# Display those with highest cosine similarity
similarities.nlargest()
```

2. Source: https://archive.ics.uci.edu/ml/datasets/Wine.↩

3. These fish measurement data were sourced from the Journal of Statistics Education..↩

4. Source: http://www.eurovision.tv/page/results↩