Home / Blog / Artificial Intelligence / Object Detection with Auto Annotation

Object Detection with Auto Annotation

July 07, 2025
63

Meet the Author : Gaurang Ingle

I am a passionate data science and AI enthusiast with a strong focus on implementing machine learning techniques to solve real-world problems. I thrive on automating tasks and optimizing processes. My proactive approach, along with a dedication to automation, enables me to drive meaningful progress on projects.

Object Detection with Auto Annotation:

Traditional methods of object detection have long relied on meticulous manual annotation, a process where humans diligently label objects in images. While effective, this method is time-consuming, labor-intensive, and can be expensive. This blog explores an alternative paradigm that is transforming the landscape of object detection – one that liberates us from the constraints of manual annotation.

Exploring the Object Detection Pipeline: Unveiling the Steps

Now that we've delved into the essence of Auto Annotation, let's unravel the details of the object detection pipeline. This innovative approach liberates us from the constraints of manual annotation, offering a more efficient and cost-effective alternative to traditional methods. The object detection steps unfold as a structured process, beginning with data collection and continuing through deployment, end-to-end.

Step 1: Data Collection

• The foundation of any successful object detection model lies in the quality and diversity of the data it is trained on. We explore the details of gathering, selecting, and arranging datasets crucial for a strong detection system.

• Critical elements for a robust dataset:

Images per class - It is recommended to use at least 1500 images in each class.

Instances per class - The recommended number of instances (labeled objects) per class is ≥ 10,000.

Image variety - The deployed environment must be reflected in it. In practical applications, we suggest utilizing photos from various sources (such as scraped from the internet, locally gathered, or taken with various cameras), at various times of day, seasons, and weather conditions.

Background images - Background images are photos that include no objects and are used in datasets to reduce False Positives (FP). We recommend 0-10% backdrop photos to assist reduce FPs (COCO contains 1000 background images, which accounts for 1% of the total). Labels are not necessary for background photos.

Step 2: Auto Annotation - Crucial Aspect for Automation

In the above picture, you can witness various annotation techniques in action. To extend the scope of object detection, I'll be utilizing Instance Segmentation.

Grounding DINO, serving as the base model, specializes in Zero-shot detection—a capability allowing it to identify objects not encountered during training. This Vision-Language Object Detection system processes both images and text prompts, providing detection outputs (coordinates of the text prompt's objects). Although it may operate more slowly in real-world object detection scenarios, its ability to furnish coordinates of bounding boxes makes it an ideal fit for the Auto Annotation process, serving as the initial step for Auto Instance Segmentation Annotation.

Grounding DINO docs: https://github.com/IDEA-Research/GroundingDINO

To achieve instance segmentation, you can harness the power of the Segment Anything Model (SAM) from Meta AI. SAM excels in producing high-quality object masks from various input prompts, including bounding boxes. With SAM's promptable design, it seamlessly takes bounding boxes, generated by Grounding DINO, as input prompts. This unique capability allows SAM to generate masks precisely within those bounding boxes, making it an invaluable component for achieving instance segmentation in your workflow.

SAM Docs: https://github.com/facebookresearch/segment-anything

To explore Auto Instance Annotation using Grounding Dino and SAM on Google Colab Notebook, generously provided by Roboflow, and crafted by the talented Piotr Skalski, click on the link here [Google Colab Notebook].

Follow the outlined steps below to grasp the comprehensive process of auto annotation, as demonstrated using Grounding Dino and SAM on the Google Colab Notebook provided above.

1. Install the necessary libraries and weights; in this context, weights denote models like Grounding DINO and SAM. Make sure you have the essential models in place to proceed with the Auto Instance Annotation process.

2. Next, load these models into your environment to kickstart the annotation workflow.

3. Now, upload your dataset to Google Colab. If your dataset is in Google Drive, proceed to mount your Google Drive on Google Colab to access your dataset. Ensure seamless integration for a smooth Auto Instance Annotation process.

4. Now, we'll commence the use of Grounding DINO (Base Model) with a text prompt, specifying what to detect, and a single image for initial testing. This step ensures the effectiveness of the chosen prompt. In the context of Indian coins, I experimented with prompts like '1Rs,' '2Rs,' '5Rs,' and '10Rs' in a Python list, but the results were unsatisfactory. Upon trying a more generalized prompt like 'coin,' which produced positive results, a challenge arose. It was difficult to assign different classes like '1Rs,' '2Rs,' '5Rs,' and '10Rs' to coins under the umbrella category 'coin.' To overcome this, I structured the dataset to include multiple instances of the same class in each image, for example, featuring seven coins of '10Rs.'

Following the extraction of bounding box coordinates, you can automate the process of changing the class name to '10Rs' using Python code. However, I chose a more straightforward method: uploading images and bounding boxes with the default class name 'coin' to Roboflow and adjusting the class name from 'coin' to '10Rs' for all '10Rs' images in the Roboflow UI, as we'll discuss in the following steps.

5. To accomplish Instance Segmentation, we will now feed the previously obtained bounding box coordinates from Grounding DINO to SAM as a prompt, as discussed earlier. SAM will then generate a mask and polygon coordinates in Pascal VOC XML format for the Region of Interest (ROI) inside the bounding box.

Note: It's important to note that we are currently verifying the results for a single image to assess each step before automating the processes through a Python loop, transitioning from bounding box annotation to instance segmentation (from Grounding DINO to SAM) for every image in the dataset.

6. Now that we have verified our results, we can proceed to automate the process for all images in a loop. As they pass through Grounding DINO, the first step will involve obtaining bounding boxes. Subsequently, after receiving results from SAM, masks (Polygon Coordinates) will be generated for all 10Rs images simultaneously.

7. In preparation for uploading our annotations to Roboflow, we need to save them using the latest supervision features, precisely the 'dataset save' functionality introduced in the recent 0.6.0 update. This streamlined process, facilitated by the sv.Dataset module, allows us to convert our annotations into Pascal VOC XML format.

The code snippet provided showcases the implementation, where the annotations are organized in the specified directory path, and additional parameters such as minimum and maximum image area percentages are configured for optimal results. This ensures our annotations are well-prepared and ready for seamless integration into the Roboflow platform.

8. With both images and their respective Pascal VOC XML annotation files ready, the next step is to upload them to Roboflow. This can be done by creating a new project or uploading into your existing project and logging in using the Roboflow Command Line Interface (CLI).

9. Now that you have created a new project and successfully logged into the Roboflow Editor, you can commence the process of uploading all your 10Rs images along with their corresponding annotation files.

10. In the Roboflow Editor, you can access your newly created project and proceed to change the class name from 'coin' to '10Rs'.

11. Now repeat it for each coin till you have annotated for 1Rs, 2Rs & 5Rs coins.

Step 2.1: Data Pre-Processing (In Roboflow)

• Before delving into model training, the crucial step of data pre-processing ensures that the dataset is refined and well-organized. This involves tasks like cleaning, resizing, and normalizing the images. Additionally, performing the train-test-validation split helps in evaluating the model's performance accurately. Notably, tools like Roboflow's annotation tool streamline this process seamlessly, providing an automated solution without the need for extensive Python coding.

Step 2.2: Augmentation

• Augmentation techniques play a vital role, enhancing both dataset diversity and annotation robustness through practices like flipping, rotation, scaling, mitigating overfitting risks during model training, and exposing the model to a diverse range of scenarios, enhancing its ability to generalize to unseen data. However, a delicate trade-off exists, as excessive augmentation may introduce noise or distortions.

• Here, we will use Roboflow's inbuilt Augmentation option.

Step 3: Model Training

• In the heart of the object detection journey lies model training, a pivotal step where Ultralytics YOLOv8 takes center stage. Leveraging state-of-the-art architecture, YOLOv8 optimizes the learning process by efficiently identifying and refining patterns within the dataset. Its robust capabilities, combined with user-friendly features, make it a formidable ally in achieving accurate and efficient object detection.

• Google Colab Notebook [click here]: credits https://github.com/roboflow/notebooks.

• Copy the code snippet and update it in the notebook provided, this will download your data from Roboflow Universe to train the model

• You can select an appropriate model based on your requirements, such as YOLOv8n-seg (i.e., the nano model) for real-time object detection.

• Start Training your Model:

• Model Results are stored in the default directory below.

• Result on our Validation data.

• Here you can find the final model (i.e. best.pt).

• Great job! You've successfully trained your custom model.

• Here’s how to test it with a sample image.

• Result

Step 4: Post-Training Challenges

• No journey is without obstacles. We'll discuss common issues encountered after model training, addressing concerns such as.

- Overlapping Detections:

• Imagine your model sometimes sees one object as two or more. To fix this, we use something called Non-Maximum Suppression (NMS). It's like having your model pick the best guess when it sees overlapping things. We adjust a setting called IoU to make sure it decides what counts as a separate object. It's like telling your model, "If things overlap too much, just choose the most confident guess and ignore the others." This helps make sure your model doesn't get confused by objects that are too close or on top of each other. (i.e. post-processing techniques).

- False Positives:

• To tackle false positives, consider re-evaluating your training data for better diversity and representation. Fine-tuning the model with additional, challenging examples can help it distinguish between real objects and misleading patterns. Adjusting the confidence threshold during inference can also play a crucial role in minimizing false positives by setting a higher threshold for detection. Fine-tune confidence thresholds: Experiment with different confidence thresholds during inference to find the balance between being cautious (higher threshold) and being inclusive (lower threshold) to reduce false positives.

Step 5: Strategies for Model Optimization

1. Increase Dataset Size:

Explanation: Provide more examples for the model to learn from.

Example: If you're training a model to recognize cats and dogs, show it pictures of various breeds, sizes, and colors to make it more adaptable.

2. Improve Labeling Accuracy:

Explanation: Ensure the labels precisely outline the objects.

Example: If you're marking a cat, make sure the box covers the cat entirely without any gaps or extra space.

3. Increase Model Complexity:

Explanation: Use a more advanced model with more parameters for better understanding.

Example: Instead of using a basic model, try a larger YOLOv8 model like yolov8m, yolov8l and yolov8x for enhanced performance.

4. Experiment with Hyperparameters:

Explanation: Adjust settings like learning rate, weight decay, and augmentation to find the best values for your data.

- Example: Try teaching your model faster or slower (learning rate) or change how much it learns from each mistake (weight decay).

5. Consider Post-Processing Techniques:

Explanation: Apply additional steps after training to refine results and boost confidence.

Example: Use non-maximum suppression (NMS) with lower overlap thresholds (iou) to handle overlapping objects more effectively and increase confidence in the detection outcomes.

6. Train with Default Settings First:

Explanation: Establish a baseline performance before making major changes.

Example: Train your model initially with default settings to understand its behavior and performance.

7. Experiment with Epochs:

Explanation: Adjust the number of training cycles to prevent overfitting.

Example: If your model is learning too much from the training data and not doing well on new data, try training for fewer cycles (epochs).

8. Consider Image Size and Batch Size:

Explanation: Modify the resolution of images and batch size for optimal results.

Example: Train with the largest image size and use the largest batch size supported by your hardware for better performance.

9. Fine-Tune Hyperparameters:

Explanation:Experiment with hyperparameter settings for improved training.

Example: Adjust augmentation hyperparameters to delay overfitting and modify loss component gain hyperparameters for specific loss components.

Remember, experimentation and iteration are key. Make gradual adjustments, evaluate the impact on performance, and iterate based on the observed results.

Step 6: Deployment

For deploying your model, you have the flexibility to choose platforms and frameworks according to your needs. However, for demonstration, I recommend using Streamlit. The code for integrating YOLOv8 with Streamlit Web Real-Time Connection (WebRTC) can be found in my deployment Git repository [provide the link]

Here, replace the default model in the mentioned Git repository with the model you built.

Conclusion

In conclusion, by embracing automated annotation and leveraging advanced models, we can revolutionize object detection, making it more efficient and cost-effective. We appreciate your interest in this cutting-edge method, and we hope it will lead to new opportunities for your object detection research.