Hand-written digit recognition (MNIST)

The challenge is to recognize hand written numbers from images. This is the famous MNIST dataset (available here), a good computer vision problem to start and practice with.

What is provided:
– A training data set: 42000 rows. Contains the “label” column.
– A test data set: 28000 rows. Does not contain the “label” column.
As indicated in the link above, each row represents a gray-scale image of a hand-drawn digit, from zero through nine. It is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

Libraries

import numpy as np
import pandas as pd
from PIL import Image
from sklearn import svm, model_selection, metrics
from sklearn.linear_model import LogisticRegression
from tensorflow import keras
import tensorflow as tf

Data preparation

Let’s load both training and test data

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

We can see the image behind each row in both these data sets. Here is an example of 3 random indices in the training set. Change the value of index get multiple images.

image_size = 28
index = 0
train = train[train.columns.difference(["label"],sort=False)]
first_image = train.iloc[index,:].to_numpy(dtype=np.uint8).reshape(image_size, image_size)
image = Image.fromarray(first_image)
image.show()

There isn’t much to do here to prepare data, as it is already clean. However, we separate the training set into y (the label) and x (without the label).

y = train["label"]
x = train[train.columns.difference(["label"], sort=False)]

And we also perform scaling and split the train set to get a training set and a development (cross validation set). I could have split the data into a training, cross validation and a test set. But since I won’t do hyper parameter tuning, there is no need to split the set into three sets.

x_train, x_test = x_train / 255.0, x_test / 255.0
# split into training and cv sets
x_train, x_cv, y_train, y_cv = model_selection.train_test_split(x, y, test_size=0.1, random_state=42)

Modeling

Since this is a computer vision problem, deep learning is the go-to solution. However, we will still explore simple solutions at first before moving to more complex models. We’ll then be able to compare the accuracy (or any other chosen metric) of different models.

SVM (with linear kernel)

SVM can be very slow when the training set is large. So, we will the randomized search instead of grid search for hyper parameter tuning. Also, because of both the size and the number features, we will a linear kernel.

range_c = 2. ** np.arange(-7,7,1)
param_grid = {'C': range_c}
model = model_selection.RandomizedSearchCV(svm.SVC(random_state=42, kernel="linear"), param_grid, cv=5, scoring='accuracy')
model.fit(x_train, y_train)
pred_train = model.predict(x_train)
pred_cv = model.predict(x_cv)

Shallow neural network

Let’s start with a one hidden layer neural network, so a shallow neural network. I chose the neurons number in the hidden layer to be equal to the features number / 2, so approximately 400. It could a hyper parameter to be tuned, but it may take some time, depending on which device it runs on.

model = keras.Sequential([
        keras.layers.Dense(400, activation=tf.nn.leaky_relu),
        keras.layers.Dense(10, activation="softmax")])
adam = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer=adam, loss="sparse_categorical_crossentropy", metrics=["accuracy"])
model.fit(x_train.to_numpy(), y_train.to_numpy(), validation_data=(x_cv.to_numpy(), y_cv.to_numpy()), epochs=100,verbose=True)
pred_train = np.argmax(model.predict(x_train), axis=1)
pred_cv = np.argmax(model.predict(x_cv), axis=1)

Convolution neural network

Since this a problem linked to visual imagery, we will use a convolution neural network. First, let’s reshape the data into (m=size of examples,28,28,1), 28 as we have 28 pixels and 1 as the images are in grey scale.

x_train = x_train.values.reshape(x_train.shape[0], 28, 28, 1)
x_cv = x_cv.values.reshape(x_cv.shape[0], 28, 28, 1)

The model definition I choose is highly inspired from LeNet-5 model:
– A convolution layer with 6 filters, each one of size 5*5, with no stride and valid padding.
– A max pool layer to split by half both height and width of input.
– A convolution layer with 16 filters, each one of size 5*5, with no stride and valid padding.
– A max pool layer to split by half both height and width of input.
– A succession of dense layers, the last one has a softmax activation function to correctly classify the input as a digit from 0 o 9.

model = keras.Sequential([
        keras.layers.Conv2D(filters=6, kernel_size=5, strides=1, padding='valid', activation=tf.nn.leaky_relu,
                            input_shape=(28,28,1)),
        keras.layers.MaxPool2D(pool_size=(2, 2), strides=None, padding='valid'),
        keras.layers.Conv2D(filters=16, kernel_size=5, strides=1, padding='valid', activation=tf.nn.leaky_relu),
        keras.layers.MaxPool2D(pool_size=(2, 2), strides=None, padding='valid'),
        keras.layers.Flatten(),
        keras.layers.Dense(120, activation=tf.nn.leaky_relu),
        keras.layers.Dense(84, activation=tf.nn.leaky_relu),
        keras.layers.Dense(10, activation="softmax"),
    ])
adam = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999,
amsgrad=False)
model.compile(optimizer=adam, loss="sparse_categorical_crossentropy", metrics = ["accuracy"])
model.fit(x_train, y_train, validation_data=(x_cv, y_cv), epochs=50,verbose=True)
pred_train = np.argmax(model.predict(x_train), axis=1)
pred_cv = np.argmax(model.predict(x_cv), axis=1)

A synthesis of accuracy for the 3 models. As expected, the CNN model gives the best results. If trained a little big longer ( bigger epoch), the results are slightly better on both training and cross validation (or test) sets.

	SVM	Shallow NN	CNN
Training accuracy	0.954	0.997	0.998
Test accuracy	0.938	0.976	0.984

Let’s check some of the examples that the CNN miss-classified. The majority is hard to classify, even for a human.

I hope this was helpful, I know I had a lot of fun with my first computer vision problem ! It was the ideal opportunity to dive into deep learning.

Thanks for reading this post and see you soon !