---
jupytext:
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.16.2
kernelspec:
  display_name: python3
  language: python
  name: python3
---

# Digit recognition with SQFA

In this tutorial, we compare SQFA to standard dimensionality
reduction methods using the digit recognition dataset
[Street View House Numbers (SVHN)](http://ufldl.stanford.edu/housenumbers/).
We compare SQFA to different standard methods available
in the `sklearn` library: PCA, LDA, ICA and Factor Analysis.
To compare the methods, we test the performance of a
Quadratic Discriminant Analysis (QDA) classifier
trained on the features learned by each method.

We will show that SQFA features outperform those learned
by the other methods, while being learned in approximately
the same time as LDA filters.


## Street View House Numbers (SVHN) dataset

The SVHN dataset consists of images of house numbers taken from Google Street View,
and while it has a similar structure to the MNIST dataset, it is
significantly harder. Let's load the dataset and visualize some of the images.

```{code-cell} ipython3
:tags: [remove-output]

import torch
import matplotlib.pyplot as plt
import torchvision

torch.manual_seed(2)

# Download and load training and test datasets
trainset = torchvision.datasets.SVHN(root='./data', split='train', download=True)
testset = torchvision.datasets.SVHN(root='./data', split='test', download=True)

# Convert to PyTorch tensors, average channels and reshape
n_samples, n_channels, n_row, n_col = trainset.data.shape
x_train = torch.as_tensor(trainset.data).float()
x_train = x_train.mean(dim=1).reshape(-1, n_row * n_col)
y_train = torch.as_tensor(trainset.labels, dtype=torch.long)
x_test = torch.as_tensor(testset.data).float()
x_test = x_test.mean(dim=1).reshape(-1, n_row * n_col)
y_test = torch.as_tensor(testset.labels, dtype=torch.long)

# Scale data and subtract global mean
def scale_and_center(x_train, x_test):
    std = x_train.std()
    x_train = x_train / (std * n_row)
    x_test = x_test / (std * n_row)
    global_mean = x_train.mean(axis=0, keepdims=True)
    x_train = x_train - global_mean
    x_test = x_test - global_mean
    return x_train, x_test

x_train, x_test = scale_and_center(x_train, x_test)
```

```{code-cell} ipython3
# See how many dimensions, samples and classes we have
print(f"Number of dimensions: {x_train.shape[1]}")
print(f"Number of samples: {x_train.shape[0]}")
print(f"Number of classes: {len(torch.unique(y_train))}")
print(f"Number of test samples: {x_test.shape[0]}")

# Visualize some of the centered images
names = y_train.unique().tolist()
n_classes = len(y_train.unique())
fig, ax = plt.subplots(2, n_classes // 2, figsize=(8, 4))
for i in range(n_classes):
    row = i // 5
    col = i % 5
    ax[row, col].imshow(x_train[y_train == i][20].reshape(n_row, n_col), cmap='gray')
    ax[row, col].axis('off')
    ax[row, col].set_title(names[i], fontsize=10)
plt.tight_layout()
plt.show()
```

We see that we have 10 classes and that the training
data consists of 73257 samples of 1024 dimensions. We will now
learn 9 filters for this dataset using each of the different
dimensionality reduction methods.

:::{admonition} Maximum number of filters
A limitation of LDA is that it can learn a maximum of $c-1$ filters, where
$c$ is the number of classes. This is the reason why we learn 9 filters
in this tutorial. SQFA does not have this limitation.
:::

```{code-cell} ipython3
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.decomposition import PCA, FastICA, FactorAnalysis
from sklearn.cross_decomposition import CCA
import sqfa
import time

N_FILTERS = 9

# TRAIN THE DIFFERENT MODELS

# Train PCA
pca = PCA(n_components=N_FILTERS, svd_solver='covariance_eigh') # Fastest solver
start = time.time()
pca.fit(x_train)
pca_time = time.time() - start
pca_filters = pca.components_

# Train LDA
shrinkage = 0.8  # Set to optimize LDA performance and have smoother filters
lda = LinearDiscriminantAnalysis(solver='eigen', shrinkage=shrinkage)
start = time.time()
lda.fit(x_train, y_train)
lda_time = time.time() - start
lda_filters = lda.coef_[:N_FILTERS]

# Train ICA
ica = FastICA(n_components=N_FILTERS, random_state=0, max_iter=1000)
start = time.time()
ica.fit(x_train)
ica_time = time.time() - start
ica_filters = ica.components_

# Train Factor Analysis
fa = FactorAnalysis(n_components=N_FILTERS, random_state=0, max_iter=1000)
start = time.time()
fa.fit(x_train)
fa_time = time.time() - start
fa_filters = fa.components_

# Train SQFA
# Get noise hyperparameter from PCA variance
x_pca = torch.as_tensor(pca.transform(x_train))
pca_var = torch.var(x_pca, dim=0)
noise = pca_var[2] * 0.05

sqfa_model = sqfa.model.SQFA(
  n_dim=x_train.shape[1],
  n_filters=N_FILTERS,
  feature_noise=noise,
)

start = time.time()
sqfa_model.fit_pca(x_train) # Initialize filters with PCA
sqfa_model.fit(
  x_train,
  y_train,
  show_progress=False,
)
sqfa_time = time.time() - start
sqfa_filters = sqfa_model.filters.detach()
```

Let's visualize the filters learned by each method.

```{code-cell} ipython3
model_names = ["SQFA", "LDA", "PCA", "ICA", "FA"]
model_filters = [sqfa_filters, lda_filters, pca_filters,
                 ica_filters, fa_filters]

# Function to plot filters
def plot_filters(filters, title):
    fig, ax = plt.subplots(1, N_FILTERS, figsize=(10, 2))
    for i in range(N_FILTERS):
        ax[i].imshow(filters[i].reshape(n_row, n_col), cmap='gray')
        ax[i].axis('off')
        ax[i].set_title(f"Filter {i+1}")
    fig.suptitle(title, fontsize=16)
    plt.tight_layout()

for name, filters in zip(model_names, model_filters):
    plot_filters(filters, name)
    plt.show()
```

The features learned by the three models look different.
First, unsurprisingly, the filters learned by supervised
methods LDA and SQFA focus mostly on the digits, while
the filters learned by the unsupervised methods have
a considerable fraction of their weights in the background.
Second, SQFA filters have a more digit-like structure than
the rest of the methods. 

:::{admonition} Filter initialization
A good initialization of the filters can considerably speed up
the learning process. The method `fit_pca` of the `SQFA` class
sets the filters to the PCA components of the data.
:::

Lets evaluate how well the filters separate the classes quadratically,
by using a QDA classifier on each feature set.
QDA fits a Gaussian distribution (mean and covariance) to
each class and uses the Bayes rule to classify samples. Both the
class specific means and covariances are used
to classify samples.

```{code-cell} ipython3
def get_qda_accuracy(x_train, y_train, x_test, y_test, filters):
    """Fit QDA model to the training data and return the accuracy on the test data."""
    # Get the features
    filters = torch.as_tensor(filters, dtype=torch.float)
    z_train = torch.matmul(x_train, filters.T)
    z_test = torch.matmul(x_test, filters.T)
    # Fit QDA model
    qda = QuadraticDiscriminantAnalysis()
    qda.fit(z_train, y_train)
    y_pred = qda.predict(z_test)
    accuracy = torch.mean(torch.as_tensor(y_pred == y_test.numpy(), dtype=torch.float))
    return accuracy

accuracies = []

for name, filters in zip(model_names, model_filters):
    accuracy = get_qda_accuracy(x_train, y_train, x_test, y_test, filters)
    accuracies.append(accuracy.item() * 100)

# Plot accuracies
fig, ax = plt.subplots(figsize=(6, 3))
plt.bar(range(len(accuracies)), accuracies)
plt.xticks(range(len(accuracies)), model_names, fontsize=12)
plt.yticks(fontsize=12)
plt.ylabel("QDA Accuracy (%)", fontsize=14)
plt.xlabel("Features", fontsize=14)
# Print the accuracies on top of the bars
for i, acc in enumerate(accuracies):
    plt.text(i, acc + 1, f"{acc:.1f}%", ha='center', fontsize=12)
plt.tight_layout()
ax.set_ylim([0, 100])
plt.show()
```

We see that SQFA outperforms all other methods by a large margin
in terms of classification accuracy. This is not surprising with
respect to the unsupervised methods, since the goal of these
methods is not to separate the classes. With respect to LDA,
it is also not surprising that taking into account the class-conditional
covariances leads to better performance (although the need to estimate
a covariance matrix for each class can make SQFA more prone to
overfitting in the absence of proper regularization).

SQFA learned the best filters for quadratic discrimination in
this task, but is its computational cost reasonable? Let's compare
the time it took to learn the filters for each method.

```{code-cell} ipython3
model_times = [sqfa_time, lda_time, pca_time, ica_time, fa_time]

fig, ax = plt.subplots(figsize=(6, 3))
plt.bar(range(len(model_times)), model_times)
plt.xticks(range(len(model_times)), model_names, fontsize=12)
plt.yticks(fontsize=12)
plt.ylabel("Training Time (s)", fontsize=14)
plt.xlabel("Model", fontsize=14)
# Make y axis logarithmic
plt.yscale('log')
# Print the times on top of the bars
for i, training_time in enumerate(model_times):
    plt.text(i, training_time * 1.5, f"{training_time:.2f}", ha='center', fontsize=12)
plt.tight_layout()
plt.ylim([min(model_times)*0.5, max(model_times) * 5])
plt.show()
```

We see that SQFA took approximately the same time as LDA to learn the
filters. The relative cost will depend on, but the fact that
SQFA is on par with LDA in terms of computational cost indicate that
it can be a good tool to use in practice.

In conclusion, SQFA can learn features that allow to discriminate
between classes in complex real-world datasets, and it can do so
at low computational cost.