Understanding GoogLeNet Model - CNN Architecture

Last Updated : 18 Nov, 2021

Google Net (or Inception V1) was proposed by research at Google (with the collaboration of various universities) in 2014 in the research paper titled "Going Deeper with Convolutions". This architecture was the winner at the ILSVRC 2014 image classification challenge. It has provided a significant decrease in error rate as compared to previous winners AlexNet (Winner of ILSVRC 2012) and ZF-Net (Winner of ILSVRC 2013) and significantly less error rate than VGG (2014 runner up). This architecture uses techniques such as 1x1 convolutions in the middle of the architecture and global average pooling. Features of GoogleNet: The GoogLeNet architecture is very different from previous state-of-the-art architectures such as AlexNet and ZF-Net. It uses many different kinds of methods such as 1x1 convolution and global average pooling that enables it to create deeper architecture. In the architecture, we will discuss some of these methods:

1x1 convolution : The inception architecture uses 1x1 convolution in its architecture. These convolutions used to decrease the number of parameters (weights and biases) of the architecture. By reducing the parameters we also increase the depth of the architecture. Let's look at an example of a 1x1 convolution below:
- For Example, If we want to perform 5x5 convolution having 48 filters without using 1x1 convolution as intermediate:

Total Number of operations : (14 x 14 x 48) x (5 x 5 x 480) = 112.9 M
- With 1x1 convolution :

(14 x 14 x 16) x (1 x 1 x 480) + (14 x 14 x 48) x (5 x 5 x 16) = 1.5M + 3.8M = 5.3M which is much smaller than 112.9M.

Global Average Pooling :
In the previous architecture such as AlexNet, the fully connected layers are used at the end of the network. These fully connected layers contain the majority of parameters of many architectures that causes an increase in computation cost.
In GoogLeNet architecture, there is a method called global average pooling is used at the end of the network. This layer takes a feature map of 7x7 and averages it to 1x1. This also decreases the number of trainable parameters to 0 and improves the top-1 accuracy by 0.6%
Inception Module:
The inception module is different from previous architectures such as AlexNet, ZF-Net. In this architecture, there is a fixed convolution size for each layer.
In the Inception module 1x1, 3x3, 5x5 convolution and 3x3 max pooling performed in a parallel way at the input and the output of these are stacked together to generated final output. The idea behind that convolution filters of different sizes will handle objects at multiple scale better.

Auxiliary Classifier for Training:
Inception architecture used some intermediate classifier branches in the middle of the architecture, these branches are used during training only. These branches consist of a 5x5 average pooling layer with a stride of 3, a 1x1 convolutions with 128 filters, two fully connected layers of 1024 outputs and 1000 outputs and a softmax classification layer. The generated loss of these layers added to total loss with a weight of 0.3. These layers help in combating gradient vanishing problem and also provide regularization.

Model Architecture: Below is Layer by Layer architectural details of GoogLeNet.

The overall architecture is 22 layers deep. The architecture was designed to keep computational efficiency in mind. The idea behind that the architecture can be run on individual devices even with low computational resources. The architecture also contains two auxiliary classifier layer connected to the output of Inception (4a) and Inception (4d) layers. The architectural details of auxiliary classifiers as follows:

An average pooling layer of filter size 5x5 and stride 3.
A 1x1 convolution with 128 filters for dimension reduction and ReLU activation.
A fully connected layer with 1025 outputs and ReLU activation
Dropout Regularization with dropout ratio = 0.7
A softmax classifier with 1000 classes output similar to the main softmax classifier.

This architecture takes image of size 224 x 224 with RGB color channels. All the convolutions inside this architecture uses Rectified Linear Units (ReLU) as their activation functions. Results: GoogLeNet was the winner at ILSRVRC 2014 taking 1^st place in both classification an detection task. It has top-5 error rate of 6.67% in classification task. An ensemble of 6 GoogLeNets gives 43.9 % mAP on ImageNet test set.