top of page

Use of Convolutional Neural Networks

A convolutional Neural Network is a type of neural network that is most often applied to image processing problems and can also be used for other data analysis and classification problems. Convolutional Neural Networks are so important in Deep Learning and Artificial Intelligence today, and they have managed to achieve superhuman performance on some complex visual tasks. They power image search services, self-driving cars, automatic video classification systems, and more. Moreover, CNN's are not restricted to visual perception: they are also successful at many other tasks, such as voice recognition or natural language processing (NLP). It is an Artificial Neural Network with some specialization for being able to pick out or detect patterns and make sense of them, this pattern detection makes CNN more useful for image analysis.

Why CNN?

  • CNN shows very high accuracy in image recognition problems.

  • It Automatically detects the crucial features without any human instructions. With ANN, concrete data points must be provided. When using CNN, these spatial features are extracted from image input. This makes CNN ideal when thousands of features need to be extracted. Instead of measuring each component, CNN gathers these features on its own.

  • Avoids too much computation. Regular Neural Networks work fine for small images (e.g., MNIST); it breaks down for larger images because of the huge number of parameters it requires. For example, a 100 × 100 image has 10,000 pixels. If the first layer has just 1,000 neurons (which already severely restricts the amount of information transmitted to the next layer), this means a total of 10 million connections. And that’s just the first layer. CNNs solve this problem using partially connected layers and weight sharing, causing reduced memory requirements and computational power.

  • CNN is location invariant, where the object's position in the image should not be fixed for it to be detected.

  • For the above images, the regular neural networks require different training samples depending on the location of the object to improve the performance of the algorithm, whereas CNN classifies the image irrespective of the location of the object.

Working of CNN:

The term convolutional neural networks refer to neural networks with a specific network architecture, each hidden layer typically consists of two distinct layers: the first stage is the result of a local convolution of the layer before it (the kernel has trainable weights), and the second stage is a max-pooling stage, where the number of units is drastically reduced by keeping only the maximum response of a few branches from the first stage. In most cases, the ultimate layer is fully connected after numerous hidden levels. Each unit receives input from all of the branches of the preceding layer, and it has a unit for each class that the network predicts.

Typical CNN architectures stack a few convolutional layers (each generally followed by a ReLu layer), then a pooling layer, then another few convolutional layers (+ReLu), then another pooling layer, and so on. The image gets smaller and smaller as it progresses through the network, but it also typically gets deeper and deeper (i.e., with more feature maps).

Convolutional Layer:

The most crucial building block of a CNN is the convolutional layer. It requires a few components, which are input data, a filter, and a feature map. The neurons in the first convolutional layer are not connected to every pixel in the input image but only to pixels in their receptive fields (see Figure below). In turn, each neuron in the second convolutional layer is connected only to neurons located within a small rectangle in the first layer. This architecture allows the network to concentrate on small low-level features in the first hidden layer, assemble them into larger higher-level features in the next hidden layer, and so on. This hierarchical structure is common in real-world images, which is one of the reasons why CNN works so well for image recognition.

The Kernel/Filter is the component in this layer that performs the convolution operation (matrix). Until the complete image is scanned, the kernel makes horizontal and vertical adjustments dependent on the stride rate. This process is known as a convolution. The kernel is less in size than a picture, but it has more depth.

The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive field. The filter is then applied to an image area, and a dot product is calculated between the input pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept across the entire image. The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or convolved feature.

As you can see in the image above, each output value in the feature map does not have to connect to each pixel value in the input image. It only needs to connect to the receptive field, where the filter is being applied. Since the output array does not need to map directly to each input value, convolutional (and pooling) layers are commonly referred to as “partially connected” layers. However, this characteristic can also be described as local connectivity. The weights in the feature detector remain fixed as it moves across the image, which is also known as parameter sharing.


  • Connection sparsity reduces overfitting.

  • Conv + Pooling gives location-invariant feature detection.

  • Parameter sharing.

Rectified Linear Unit (ReLu):

After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the feature map, introducing nonlinearity to the model.

The feature maps are passed into an activation function - just like they would be in a normal artificial neural network. More specifically, they are passed into a rectifier function, which returns 0 if the input value is less than 0, and it returns the input value otherwise. Here is a visual representation of this ReLu layer:

The reason why the rectifier function is typically used as the activation function in a convolutional neural network is to increase the nonlinearity of the data set. You can think of this as the desire for an image to be as close to grey-and-white as possible. By removing negative values from the neurons' input signals, the rectifier function is effectively removing black pixels from the image and replacing them with grey pixels. It also speeds up training and faster computing.

Pooling Layer:

The goal of the pooling layers is to subsample the input image to reduce the computational load, memory usage, and the number of parameters.

Like in convolutional layers, each neuron in a pooling layer is connected to the outputs of a limited number of neurons in the previous layer, located within a small rectangular receptive field. You must define its size, the stride, and the padding type, just like before. However, a pooling neuron has no weights; all it does is aggregate the inputs using an aggregation function such as the max or mean.

Pooling can be divided into two types: maximum pooling and average pooling. The maximum value from the area covered by the kernel on the image is returned by max pooling. The average of all the values in the part of the image covered by the kernel is returned by average pooling. For Example, consider the figure below; in this case, max pooling is performed by choosing a kernel of size 2 x 2 with a stride of 2 and no padding. Only the max input value in each receptive field makes it to the next layer, while the other inputs are dropped. Consider the lower left receptive field; the input values are 3, 1, 1, 2, so only the max value 3 is propagated to the next layer. Because of stride 2, the output image has half the height and half the width of the input image.

Benefits of Pooling:

  • Reduces Dimensions & computations.

  • Reduce overfitting as there are fewer parameters.

  • The model is tolerant of variations and distortions.

Fully Connected Layer:

The fully connected layer (FC) works with a flattened input obtained by flattening the feature map from the pooling layer, which means that each input is coupled to every neuron. After that, the flattened vector is sent via a few additional hidden layers, where the mathematical functional operations are normally performed. The classification procedure gets started at this point. FC layers are frequently found near the end of CNN architectures if they are present.

This layer performs the task of classification based on the features extracted through the previous layers and their different filters. While convolutional and pooling layers tend to use ReLu functions, FC layers usually leverage a softmax activation function to classify inputs appropriately, producing a probability from 0 to 1.

17 views0 comments

Recent Posts

See All


bottom of page