Anomaly Detection with Auto-Encoders: How we used it for Cervical Cancer detection

5 min readSep 18, 2018

Background:

Anomaly Detection also known as Novelty Detection or One Class Classification, is a technique applied when your dataset is extremely imbalanced or entirely made up of just one class.

This is common when dealing with fraud detection, or any case where outliers rarely happen among the sample population. In our case, we are dealing with Cervical Cancer data from the Ocean Road Cancer Research Institute in Dar es Salaam, Tanzania.

In this post, we will look at how we can use a classic Auto-encoder Neural Network to easily train a network to identify anomalies.

TL;DR

It is possible to use Auto-encoders for anomaly detection for extremely imbalanced datasets and where there in only one class in the dataset.

This is done by first training the Auto-encoder to re-create the dominant class in your dataset. After its learnt how to re-create the class it has seen before, we expect it to re-create any new and novel classes poorly. By measuring how poorly the network performs on unseen data points, it is then possible to classify the new point as a novel class or a normal class.

Problem Statement:

We set out to see if we can apply machine learning techniques to identify women at risk of having Cervical Cancer to build a tool that would then provide them with the next steps and where to go next.

Being the only Cancer institute in the country, 99.99% of the patients they see have been referred by other hospitals and have already been diagnosed with cervical cancer. As you can imagine, our dataset is made up entirely of patients who are confirmed cases.

We ended up using a few algorithms in an ensemble kind of set-up but we will show only a simplified version using a subset of the features and a generic Auto-encoder.

What is an Auto-encoder anyways?

According to Wikipedia:

An Auto-encoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an Auto-encoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction.

This simply means, an Auto-encoder is a neural network whose goal is to produce an output that is as similar to the input as possible. The network must then meet 2 criteria:

The number of output neurons must equal the number of input neurons
The number of neurons in the hidden layer must be less than the number of input/output neurons

Credit: Curiously, What to do when data is missing? — Part II

How it works

As we can see in the image above, the neural network is forced to learn only the most important features due to the restriction in neurons on its hidden layers. This results in a, more often than not, pretty good copy and paste machine! Sounds useless but this will make sense soon ;)!

For the sake of simplicity, we are going to only use a subset of the dataset by only using the results of the Complete Blood Picture. These results consist of the Red Blood Cells count (RBC), White Blood Cells count (WBC), Platelets Count (PLT), Hemoglobin (Hg), Neutrophils count, Lymphocytes Count, MCV, and EOS.

Don’t worry if none of those features make sense, this can be anything. For the iris dataset the features can be sepal_width, sepal_height, petal_height, and petal_width.

Using the data subset described above, we have 8 features that we have chosen to train our model, let us design our model now. From what we know about Auto-encoders:

Our model will have 8 input neurons
Our model will have 8 output neurons (matching the input)
Our model will have less than 8 neurons in the hidden layer (we will use only 4 neurons — experiment with this value.)

Below is the code we can write to describe this neural network in PyTorch:

First we import the necessary libraries:

from torch.nn import Module, Linear, MSELoss
from torch.optim import Adam

Then we build the neural network:

# Define our constants
input_neurons = 8
hidden_neurons = 4# create our network
class AutoEncoder(torch.nn.Module):
    def __init__(self):
        super(AutoEncoder, self).__init__()
        self.fc1 = Linear(input_neurons, hidden_neurons)
        self.fc2 = Linear(hidden_neurons, hidden_neurons)
        self.fc3 = Linear(hidden_neurons, input_neurons)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)
        return x#initiate model and define loss function & optimizer algorithm
model = AutoEncoder()
critereon = MSELoss()
optimizer = Adam(model.parameters())

Thats it, well … mainly! The next step is to train you network until the accuracy cannot be improved further.

RECAP: We have created a network that learns to reproduce its input given the constriction of neurons in its hidden layers thus forcing it to learn a very compressed representation of its input.

Making Predictions

You might be asking yourself, we have made a copy and paste machine, big deal! Now what?

This is actually a big deal. We have taught the computer to create very good copies of the examples it has seen. What this means is that, we should expect the computer to create poor copies of data it has never seen before.

We can tell how poorly the computer re-creates its input by looking at the accuracy/error between the original input and the produced output.

For our case, with Cervical Cancer, we noted down how well the Auto-encoder was able to re-create a patients details and compared that performance to new and unseen patients.

Since our dataset was 100% Cervical Cancer patients, patients that are re-created well are assumed to belong to the Cancer class, and patients that are re-created poorly are assumed to belong to the “anomaly class”. Due to the nature of our dataset, the anomalies are patients without cervical cancer and the normal class are those with the cancer.

In our case, the AE on its own was able to achieve an accuracy of 88%! Not too bad!

Assumptions: I am assuming that you have split your data into training and testing set, you have already explored your data and found the best features, and are using a dataset that is dominantly or even only one class.

Happy Classifications!

Follow me on Twitter, where I post about Machine Learning, Artificial Intelligence, Python, JavaScript, Software, and sometimes (rarely) life!