Summary of Krizhevsky’s ImmageNet Classification with Deep Convolutional Neural Networks
Alex Krizhevsky, Ilya Sutskeyer, and Geoffrey E. Hinton from the University of Toronto created a neural network architecture, using convolutional layers, and called it ‘AlexNet’. They competed and won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). Here we will summarize the paper in which they describe the architecture and training of their deep convolutional network. They pioneered object detection and image classification as we know it today.
ImageNet has a dataset of over 15 million high-resolution labeled images, belonging to 22,000 different classes. They were able to do this by collecting images from the web and using Amazon’s Mechanical Turk crowd-sourcing tool (More info on Amazon’s Mechanical Turk crowd-sourcing tool here).
As of 2010, as part of Pascal Visual Recognition Challenge, ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held annually. ILSVRC uses roughly 1,000 images in 1,000 different classes. In all they ran their neural network on 1.2 million training images, 50,000 validation images, and 150,000 testing images and reported two error rates: top-1 and top-5, the top-5 error rate is the fraction of images in which the correct label is not among the 5 most probable classes.
They tested many different architectures, removing and adding layers until they got to, two GPU’s running 8 learned layers, consisting of 5 convolutional layers and three fully connected layers. Below is an image of their architecture depicting the two GPU’s, where one runs the layer-part at the top of the figure while the other runs the layer-part at the bottom, each communicating in certain layers.
Key features of their neural network architecture
ReLU Nonlinearity: To prevent overfiting and have faster learning they decided to not use sigmoid or tanh activation functions which were more common at the time.
Training on Multiple GPU’s: (2 GTX 580 GPU 3GB): Using cross-GPU parallelization, they were able to read and write to one another s memory directly, splitting the work each GPU has to do and only communicating on certain layers. This allowed them to reduce training time as well as lower their error rate.
They also used a new technique called Dropout to further prevent over fitting.
Their neural network took between 5 to 6 days to train while using both GPUs.
They were able to win and although their results may not look impresive in today’s standard they were able to significantly improve what had been seen before and even claimed that “Our results can be improved simply by waiting for faster GPU’s and bigger datasets to become available”. Below are the results of the ILSVRC 2010 and 2012 competitions.
With the quick advances in GPU speed and bigger datasets, not to mention overall software and hardware upgrades, their research and advancement in the field has proven to be invaluable.
Their use of Relu, parallel processing and dropout are considered standard today.
I find it exciting what they were able to accomplish and humbled by how technology has advanced in the last decade. Only time will tell what the next decade has in store, and I hope I can be part of its progress.