“The more relevant patterns at your disposal, the better your decisions will be.” — Herbert SimonVehicle Identification Image Recognition (PDF)
Convolutional Neural Networks, abbreviated as ConvNet or CNN, are feed-forward artificial neural networks with learnable weights and biases often used for image recognition in machine learning. The development of CNN's trace back to the fields of biology and brain research in the 1960’s, particularly with Hubel and Wiesel’s breakthrough understanding of the locally-sensitive, orientation-specific neurons in a cat’s visual cortex .
Two key concepts underlying Convolutional Neural Networks are local receptive fields and weight sharing . Using local receptive fields, each unit in a neuron receives input from a small, localized region of the input image or the previous layer. Neurons then extract basic features such as corners, edges, and end points and combine them with subsequent layers to extract higher level features.
Next, elementary features across different image locations share the same weights. This reduces the number of learnable parameters and makes it possible to train them with fewer examples. The forward pass of the shared weights along with the input image form the convolution that gives the model its name. This weight sharing makes CNN's computationally efficient for scaling up to large data sets, as it reduces the amount of learnable parameters.
As a pattern recognition system, CNN's receive a raw pixel image as an input and perform a series of transformations through hidden layers. The Convolution layer performs a directed grid search across the image to detect features and construct a feature map that is output to the next layer. The Pooling layer performs downsampling to reduce the resolution of the feature map and reduce the output sensitivity to shifts and distortions . Finally, the Fully-Connected, or output layer, reduces the image to a single vector of class scores.
The purpose of the experiment was to train a Convolutional Neural Network for the classification performance of “car” or “non-car” images. In addition to training on the original cars dataset, we performed training on each of the preprocessed methods to compare test results.
We used the Caffe deep learning framework created by Yangqing Jia and developed by the Berkeley Vision and Learning Center (BVLC) using the NVIDIA DIGITS (Deep GPU Training System) interface. Caffe is a ConvNet implementation that uses the Open Source Computer Vision (OpenCV) library and supports graphics processing unit (GPU) acceleration using the NVIDIA CUDA Deep Neural Network library (cuDNN) to reduce training time. The environment consisted of the Ubuntu 14.04 LTS operating system, NVIDIA GTX 970 CUDA enabled graphics card, Intel Xeon Processor x5680 (12M Cache, 3.33 GHz, 6.40 GT/s Intel QPI), 24GB RAM, and 1 TB HDD.
We used Stanford Cars as the training dataset, which contained 8,147 color images. The dataset was split during classifier training with a ratio of 85% training (7061 images), 5% validation (415 images), and 10% testing (831 images). The image type was “COLOR” and encoding “jpg”.
We trained the AlexNet CNN model over 30 epochs. Each image was resized by squash method to dimensions of 256 x 256 pixels and input into the model. Identical parameters as for the original dataset were used for the Invert, Otsu, and Patch preprocessed dataset variations.
Following training, we tested each trained model with a set of 25 new images from the ImageNet database comprised of a mix of car and non-car images. Among the car images were alternate perspectives (front, rear, or oblique views), grouped objects (cars in a parking lot), intra-class variations (double-decker bus, military truck, tank, train car, cable car), scale variations, obfuscated objects, and close-up perspectives. The noncar images included photos of clouds, trees, sidewalk, buildings, and a horse carriage, which will be briefly discussed later.
Trained over the original dataset, the Convolutional Neural Network performed with high accuracy on tests of new images. The classifier detected features common to class members and applied its learning to identify cars in most true cases. Intra-class variations, such as trucks and buses, were successfully classified as cars. Interestingly, the horsedrawn carriage was classified as a car and, generally sharing similar features such as wheels and method for human conveyance (as well as being a historical predecessor to cars), is arguably correct. Other similar objects such as the cable car, train car, and tank were also identified as the same class as the Cars training set. Additionally, the close-up perspective of a wheel mounted to the back of a jeep was classified as a car.
The classifier overcame distortion and occlusion in class objects. The crashed race car was correctly identified even though its rear was destroyed in the image. The partial image of a car in the parking lot, occluded by a bush and parking gate was also correctly identified.
The ConvNet failed to identify cars where viewpoint variation occurred. Images showing the rear of a car were not identified correctly, while images showing cars from the front perspective were correctly identified. This lapse in accuracy can likely be overcome by increasing the training set, which may include class objects in these orientations. Similarly, scale variation challenged the accuracy of the classifier, as several instances of cars that appeared small in size were not correctly identified. As with viewpoint variation, this may be overcome by increasing the size of the training set.
False positives occurred as well. Several photos of clouds were correctly identified as non-class members. One image of clouds, however, was falsely labeled as a car. Upon examination of the image, its oblique edge and spatial orientation shows similarity to the perspective of many cars in the training set, which may have contributed to this false conclusion.
Tests on the Invert, Otsu, and Patch dataset variations performed significantly lower than the original image training set. In all three cases of pre-processing, the classifier saw each of the 25 test images as non-class members. CNN's perform its own pre-processing during the feature extraction step in the Convolution layer; therefore, pre-processing performed before input is not necessary and does not aid in classification, as our test results demonstrate.
* Alternative frameworks to Caffe include Torch7 and Theano
** An alternative model, LeNet, similarly squashes images to 32 x 32 pixels for processing.
1. D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction, and functional architecture the cat’s visual cortex,”Journal of Physiology (London), vol. 160, pp. 106154, 1962.
2. Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, 2010.
3. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document recognition. Proceeding of the IEEE, 1998.