Call Us

Home / Blog / Artificial Intelligence / Object Detection Using CNN

Object Detection Using CNN

  • July 12, 2023
  • 5016
  • 44
Author Images

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 18+ years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

Read More >

Do you remember Waldo? Tall, thin man last seen with striped shirt and blue trousers. It is difficult to find him without turning blind or cross-eyed and then slowly vanishing into the sea of barriers and many characters. But what if our lost friend could be found in a couple of milliseconds by a straightforward computer algorithm? That is Object Detection Algorithms' potential. This method of computer vision can dissect numerous intriguing features to create a fan picture and determine what things are present and where they are. It is capable of accurately and meaningfully comprehending the contents of images. Computer vision (CV) is a technology that has the ability to "see" and comprehend the world. One of the most popular trends in a variety of business sectors, including medical, industrial automation, surveillance, defence, and many more, is the CV. The emergence of numerous developers and researchers in the field of computer vision may be attributed to the development of new trends and technologies, as well as enhanced processing power and optimised hardware. Object detection is one of computer vision's perplexing problems. Due to its use in the sensitive area, the capacity to recognise an item extremely accurately is crucial. (Healthcare, traffic, autonomous cars, security, etc.)

CNN & its various Layers

Figure 1 Where is Waldo? (Source:The great Picture hunt, Handford)

Item Detection, as the name indicates, locates and recognises an item inside an image. The issues of object detection are addressed using a variety of strategies and methodologies. The development of computer vision was made possible by deep learning techniques, which also let us achieve cutting-edge outcomes in object identification, feature extraction, and picture recognition. The object detection feature of CV needs consideration of a variety of elements in a picture, including the item's size, orientation, light circumstances, etc. By using object detection, we can categorise images and find and pinpoint their interesting characteristics. Understanding the various terminologies and how they differ is important since computer vision involves many different jobs. Picture classification is a technique used to give a picture a class label. By tracing a bounding box around each object in a picture, the object localization technique emphasises one or more specific objects. Object detection employs a bounding box to enclose each object found in an image and assigns it a class label. It combines the methods of categorising and finding the presence of objects in an image. The term "Object Recognition" also applies to these integrated techniques.

CNN & its various Layers

Figure 2 Overview of Object Recognition Computer Vision Tasks (Source: Object Recognition with DL@Jason Brownlee)

To find and pinpoint noteworthy aspects in an image that have significance, several scientists and academics have presented methods. The most well-known and well-established Deep Learning technique among them is the Convolutional Neural Network, or CNN. The pooling layers created by the CNN are able to handle the various Object Detection characteristics. It has evolved into the framework for object detection models throughout time.

Application of Deep Learning Algorithm in Object Detection

A neural network is inspired by the functionality and structure of a brain and the visual system. In basic terms, it consists of three layers, 1. An input layer, 2. Hidden layers where the weights are set, 3. Output layer. The most popular neural network in the field of Computer Vision is the Convolutional network. The Convolutional Neural Network or CNN architecture is designed in a way that can extract features of an image, through connections and pooling layers. The hidden layers are considered as the convolutional layers, a filter is imposed on input to transform it using a specific pattern or feature, the output is then fed into the next layer. Each time the new output is sent to the next convolutional layer it is represented in different ways. This transformation allows deep learning models to learn more complex functions to recognize objects. There is a multiple-level representation involved in CNN, the representation is changed from level one to a higher level of representation. Example: The first layer extracts high-level features, such as shape, color, edges. The next layer may identify the object's presence, the next layer may classify the object with an accurate label (i.e., a Car or a Plane). Click here to learn Machine Learning in Hyderabad

CNN & its various Layers

Figure 3 CNN Architecture

The Object Detection Deep Learning Algorithms can be explained in two parts. First, an ‘encoder’ takes the source image and runs a series of processes in layers to learn and extract features. The features help locate and label each object. Secondly, a ‘decoder’ or a regressor takes the output of the encoder and predicts the location and size of each object, highlighting it with a bounding box. The output is the location of the object in an image (X, Y coordinate). The challenge with the CNN technique is, in the real-world image there can be multiple objects, belonging to multiple classes, at many different locations in an image. This would require scanning for a large number of regions and a huge computation time. To resolve the challenges of CNN accuracy and computation time, Region-based CNN, Fast RCNN, and Faster RCNN were evolved.

Learn the core concepts of Data Science Course video on YouTube:

Object Detection Using RCNN

The position of various objects in a picture is determined using RCNN, or Region Convolutional Neural Network. To search for an object, an image is divided into several zones of interest. It searches for the particular areas of interest that are likely to hold valuable objects. A selected search approach is used to process the supplied picture, producing roughly 2000 area suggestions. To identify the Object class, the proposal areas are passed via CNN and then supplied into a classification subnetwork.

CNN & its various Layers

Figure 4 Object Detection with RCNN (Source: Ross Girshick, 2015)

The process consists of the following steps:

  • Find Region Proposals or regions in the image that may contain an object.
  • Extract CNN features from the Regional Proposals.
  • Classify the objects using extracted features.

RCNN method results in a more accurate, flexible model that can customize several regions or bounding boxes. The added advantage may cost computational efficiency. To overcome the challenge of computation, a Fast R-CNN was developed.

Object Detection Using Fast R-CNN

R-CNN and Fast-RCNN both operate similarly. Region Proposals are generated using an algorithm. However, the Fast R-CNN processes the entire source picture as opposed to R-CNN, which shrinks and crops region recommendations from the original image. The CNN creates convolutional feature maps from the input picture. The regions of proposals are extracted using feature maps. Before being sent to the fully linked network, all suggested areas are first reshaped into a defined size by the RoI (Region of Interest) Pooling layer.

CNN & its various Layers

Figure 5 Fast R-CNN Architecture (Source: Fast-RCNN RoI, Ross Girshick)

The procedure in Fast R-CNN contains the following steps:

  • The input image is directly passed to the CNN network or (ConvNet)
  • The CNN or ConvNet layer generates Region of Interest using a selective search algorithm.
  • An RoI Pooling layer is applied to the extracted regions of interest to ensure all regions are one size.
  • Each Region is passed to the Fully Connected network where the ‘Softmax’ activation function is applied to output classes along with a linear regressor to predict the bounding box coordinates for the identified class simultaneously.

Fast R-CNN is somewhat more efficient than R-CNN in extracting features, performing classification, and generating bounding boxes because it only passes one area per picture to the CNN network rather than 2000. Even Fast R-CNN, however, has problems with computation time. Another object identification technique called as Faster R-CNN was created to further improve the speed of calculating object detection in the real world, where datasets may reach enormous volumes.

Object Detection Using Faster R-CNN

The Faster R-CNN algorithm combines an RPN algorithm (Regional Proposal Network) in its processing instead of using Edge boxes or a selective search algorithm. The RPN algorithm uses an Anchor Box technique for Object detection. Anchor boxes are predefined bounding boxes of a certain height and width designed to accelerate extracting region proposals. They are used to capture the scale and aspect ratio of specific object classes we want to detect. Example: in the illustration below, two anchor boxes are pre-defined with specific height and width to detect an object’s class i.e. airplane or a sailboat.

CNN & its various Layers

Figure 6 Object Detection Generation through Anchor Box (Source:

With the help of these pre-defined anchor boxes, which are tiled throughout the picture, the network is able to recognise numerous items as well as overlapping and objects of various sizes. A statistic called the Intersection over Union is used to gauge an object detector's precision. It is based on the predicted bounding box that our model predicts and the ground truth bounding box, which is the manually labelled box we define to describe precisely where the item is in an image. The two bounding boxes are applied an intersection over union, or (IoU), to produce the final accuracy score. The computation of IoU, which divides the area of union by the area of overlap, is shown in the picture below. IoU values greater than 0.5 are regarded as good predictions.

CNN & its various Layers

Computing Intersection over Union (Source: IoU for Object Detection,

An object detection method that uses anchor boxes can process an entire image at once, enabling real-time objection detection possible and faster compared to its predecessors. By considering a single image at once to extract all the objects, the algorithm may require many passes through the network to be accurate, this creates a performance challenge. To overcome such issues, modern object detection algorithms such as YOLO were introduced.

CNN & its various Layers

Figure 7 Faster R-CNN Architecture (Source: Machine learning mastery. com)

Object Detection Using Mask R-CNN

Another object detection system based on Faster R-CNN is Mask R-CNN. It is used to segment the identified objects at the pixel level. A segmentation map is created for each instance of an item using the object instance segmentation type of object detection approach. For each item that is recognised, Mask R-CNN also returns segmentation masks in addition to object and bounding box identification. For each item in a picture, a pixel-level mask may be used to give a far more detailed knowledge of the thing.

CNN & its various Layers

Figure 8 Mask R-CNN framework (Source:, How Mask R-CNN works)

Mask R-CNN was developed to solve the issue of segmentation. By masking objects, it can separate different objects in an image. The process of Mask R-CNN can be discussed in two stages as illustrated in the image. At first, it generates region proposals that may contain an object. Second, the object class is predicted, bounding boxes are refined and a mask at the pixel level of an object is generated. There is a ROI Align technique to help preserve spatial information and to locate the relevant areas of the feature map. The output is the segmentation mask for each region that contains an object.


The YOLO Model is a popular alternative to R-CNN algorithms for Object identification. "You Only Look Once" is the name of the piece. Yolo approaches are thought to do Object Detection in real-time significantly more quickly than R-CNN models. The CNN models use regions to pinpoint an object's location inside an image. The portions or areas of an image that have a high likelihood of having an item are processed by the neural network.

CNN & its various Layers

Figure 9 YOLO Object Detection Algorithm (Source: You Only Look Once: Unified, Real-Time Object Detection)

There are grid cells created from the supplied picture. A bounding box that predicts things exists inside each grid cell. The height, width, class, and confidence of an object's presence are predicted in each grid cell, together with the object's X, Y coordinates. As shown in Figure 9, the bounding boxes + confidence are merged with the class probability map to get the final detection. The YOLO method has spatial restrictions, therefore it might not reliably find tiny items in a picture. In order to increase the model's processing capability, precision, and speed, it has undergone additional evolution into many variants. This article simplifies the design of several of the well-known object detection techniques. It is possible to pre-train CNN models with large amounts of data and then fine-tune them for the practical purpose of object detection. Further research into alternative YOLO iterations as well as many other methods, publications, and tutorials can lead to advancements in the field of object identification. The deep dive into Object Detection doesn't end with this article.

Data Science Placement Success Story

Data Science Training Institutes in Other Locations

Navigate to Address

360DigiTMG - Data Science, Data Scientist Course Training in Bangalore

No 23, 2nd Floor, 9th Main Rd, 22nd Cross Rd, 7th Sector, HSR Layout, Bengaluru, Karnataka 560102


Get Direction: Data Science Course

Make an Enquiry