Introduction

Recently I took an AI course, the main content is the following topics:

  1. Learn about Coco dataset
  2. User pre-trained version of Faster R-CNN to predict the bounding box
  3. Calculate IoU

Homework Requirement

  1. Download the Coco Collection*: download the files “2017 Val images [5/1GB]” and “2017 Train/Val annotations [241MB]” from the Coco page.
    Download from Coco page. You can load them into your notebook using the pycocotools library.
  2. Randomly select ten from the dataset: 10 images are randomly selected from this dataset.
  3. Predict the box using the pre-trained model FasterR-CNN: use a pre-trained version of the Faster R-CNN (Resnet50 backbone) to predict the bounding box of the object on the 10 images. of the bounding box. Only regions with scores greater than 0.8 are retained.
  4. isualize the model together with the answer visualization*: Visualize the predicted bounding boxes and label together with the ground truth bounding
    boxes and label. Show all 10 pairs of images side by side in the jupyter notebook.
  5. Use another pre-trained model Mobilnet: Repeat the above steps using the Mobilenet backbone of the Faster R-CNN.
  6. Calculate IoU Compare Models: Which backbone provides better results? Calculate the IoU for both methods.

Task 1: Downloading the COCO Dataset

Task 1

  1. Download the COCO Dataset: Obtain the files “2017 Val images [5/1GB]” and “2017 Train/Val annotations [241MB]” from the Coco page. Utilize the pycocotools library to import them into your notebook.

You can follow this guide to proceed with the download: Download COCO Dataset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
.
├── annotations # These are annotation files
│   ├── captions_train2017.json
│   ├── captions_val2017.json
│   ├── instances_train2017.json
│   ├── instances_val2017.json
│   ├── person_keypoints_train2017.json
│   └── person_keypoints_val2017.json
└── val2017 # This is the image set
├── 000000000139.jpg
├── 000000000285.jpg
├── 000000000632.jpg
├── 000000000724.jpg
├── 000000000776.jpg
├── 000000000785.jpg
├── 000000000802.jpg
...
  • Download these two files as shown in the image.
  • After downloading, the folder structure upon extraction will resemble the one above.

Task 2: Randomly Select Ten Images

Task 2
2. Randomly Select Ten Images from the Dataset: Pick 10 images randomly from this dataset.

Here, we’ll primarily do a few things:

  • Import necessary libraries.
  • Set up the COCO API to allow it to access relevant information from our dataset, such as bounding box positions, label locations, and image information.
  • Visualize images and perform annotations.
  • Randomly select ten images.

Let’s begin by importing the necessary libraries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

# CNN
import torch.nn.functional as F
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torch.backends.cudnn as cudnn

# others
import numpy as np
import matplotlib.pyplot as plt
import time
import os
from PIL import Image
from tempfile import TemporaryDirectory
import time
import random

# torchvision
import torchvision
import torchvision.transforms as transforms

# dataset
from pycocotools.coco import COCO
import cv2


cudnn.benchmark = True
plt.ion() # interactive mode

Setting up the COCO API

COCO provides an API to access datasets. By providing it with a JSON file, we can easily retrieve the necessary information such as images, labels, bounding boxes, and more.

1
2
3
4
5
6
7
8
9
10
# Specify dataset location
cocoRoot = "../../Data/Coco/"
dataType = "val2017"

# Set annotation file location
annFile = os.path.join(cocoRoot, f'annotations/instances_{dataType}.json')
print(f'Annotation file: {annFile}')
# # initialize COCO api for instance annotations
coco=COCO(annFile)
coco

Result

1
2
3
4
5
6
Annotation file: ../../Data/Coco/annotations/instances_val2017.json
-- Indicates successful annotation file read
loading annotations into memory...
Done (t=0.35s)
creating index...
index created!

Annotation Visualization

To ensure familiarity with the COCO-provided API, here’s an exercise focusing on the following:

  • Obtaining image info by ID
  • Retrieving annotation info by ID
  • Learning to draw bounding boxes and labels on images
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# Create a function that, given an image ID, plots the image with bounding boxes and labels
def plot_image_with_annotations(coco, cocoRoot, dataType, imgId, ax=None):
# Get image information
imgInfo = coco.loadImgs(imgId)[0]
# Get image location for visualization
imPath = os.path.join(cocoRoot, dataType, imgInfo['file_name'])
# Read the image
im = cv2.imread(imPath)
# Convert color space: OpenCV defaults to BGR, but matplotlib to RGB, so conversion is needed
im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)

# Find all annotations (bounding boxes) for the image
annIds = coco.getAnnIds(imgIds=imgInfo['id'])
# Load all annotation information: bounding box coordinates, labels, accuracies
anns = coco.loadAnns(annIds)
all_labels = set()

# Extract bounding box coordinates, labels, accuracies
for ann in anns:
# Specifically select information related to the bounding box: returns (x, y) of the lower-left corner, width, height
x, y, w, h = ann['bbox']

# Get label text information: load category name by category ID
label = coco.loadCats(ann['category_id'])[0]["name"]
all_labels.add(label)

# Draw bounding boxes using provided coordinates
rect = Rectangle((x, y), w, h, linewidth=2, edgecolor='r', facecolor='none')

# Draw the image: if sorting of images is needed, ax parameter specifies the position
if ax is None:
plt.gca().add_patch(rect)
plt.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor='r')
else:
ax.add_patch(rect)
ax.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor='r')

# Display the image with a title
if ax is None:
plt.imshow(im)
plt.axis('off')
plt.title(f'Annotations: {all_labels}', color='r')
plt.show()
else:
ax.axis('off')
ax.set_title(f'Annotations: {all_labels}', color='r', loc='center', pad=20)
ax.imshow(im)


# Get the tenth image
imgIds = coco.getImgIds()
imgId = imgIds[10]
# Plot the image with annotations
plot_image_with_annotations(coco, cocoRoot, dataType, imgId)

Result

Randomly Select Ten Images

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def random_select(coco, cocoRoot, dataType, num_images=10):
# Get IDs for all images
imgIds = coco.getImgIds()
# Randomly select num_images IDs from this set
selected_imgIds = random.sample(imgIds, num_images)
# Call the plot_image_with_annotations function for each selected ID
for imgId in selected_imgIds:
# Plot images based on their IDs
plot_image_with_annotations(coco, cocoRoot, dataType, imgId)

# Print out all selected IDs
return selected_imgIds

valid_ids = random_select(coco, cocoRoot, dataType, num_images=10)
valid_ids

Result

Task 3+5: FasterR-CNN v.s Mobilnet

Task 3 & 5
3. Predicting bboxes using the pre-trained model FasterR-CNN:Use a pre-trained version of Faster R-CNN (Resnet50 backbone) to predict the bounding box
of objects on the 10 images. Only keep regions that have a score > 0.8.
5. Using another pre-trained model Mobilnet:Repeat the steps from above using a Mobilenet backbone for the Faster R-CNN.

Using pre-train model

1
2
3
4
5
6
7
# using pre-train model (FasterR-CNN)
model_res = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="FasterRCNN_ResNet50_FPN_Weights.DEFAULT")
model_res.eval()

# using pre-train model (Mobilenet)
model_mobile = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(weights=torchvision.models.detection.FasterRCNN_MobileNet_V3_Large_FPN_Weights)
model_mobile.eval()

Convert image to tensor

We need to be able to take the image out of the picture based on the position of the image. Then we need to convert the image read from the book into a tensor before we can put it into the model for prediction. So we made two functions:

  • One is to read the picture
  • One is to convert the picture to a tensor.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from PIL import Image

def load_image(imgIdx):
# Get image information
imgInfo = coco.loadImgs(imgIdx)[0]
# Get image location for visualization
imPath = os.path.join(cocoRoot, dataType, imgInfo['file_name'])
# Read the image path
print(imPath)
try:
# Read the image
return Image.open(imPath)
except:
raise Exception()

# Convert the picture to a tensor
def pil2tensor(pil_image):
# Use unsqueeze(0) because the model still contains the batch size dimension, a total of four dimensions (batch_size, channel-RGB, height, width)
# But the picture has only one picture without batch size, the picture is converted to tensor will only have three dimensions (channel-RGB, height, width), so we need to add a dimensions
# /255 is because the input of the model is a number between 0 and 1, and the value of the picture is 0~255, so it needs to be divided by 255 for normalization
return torchvision.transforms.PILToTensor()(pil_image).unsqueeze(0) / 255.0

Training the model

After the pre-training model is loaded, we need to train the model. The training process is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Save the prediction result 
predictions_res = []
predictions_mobile = []

# Recursively call each id, these ids are the 10 ids we randomly selected above
for i in valid_ids:
print(i)
# transform to tensor from PIL image
img_as_tensor = pil2tensor(load_image(i))
# put the tensor to resnet model
prediction = model_res(img_as_tensor)
# Save the prediction result: the prediction result is a dictionary, which contains the predicted bounding box, label, and accuracy
predictions_res.append(prediction)
# put the tensor to mobilenet model
prediction = model_mobile(img_as_tensor)
# Save the prediction result: the prediction result is a dictionary, which contains the predicted bounding box, label, and accuracy
predictions_mobile.append(prediction)

Only select the prediction results > 0.8

After the model is trained, we need to select the prediction results that are greater than 0.8. The reason is that the model will predict a lot of bounding boxes, but we only need the bounding boxes with high accuracy. So we need to filter out the bounding boxes with low accuracy. The code is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def filter_valid_boxes(predictions, threshold=0.8):
# Used to store the filtered prediction results
valid_boxes_list = []
# Recursively call each prediction result
for prediction in predictions:
valid_boxes_for_this_prediction = []
# Recursively call each bounding box
for box, label, score in zip(prediction[0]["boxes"], prediction[0]["labels"], prediction[0]["scores"]):
# Only keep the predicted bounding box with accuracy greater than threshold
if score >= threshold:
# Save the predicted bounding box, label, and accuracy
valid_boxes_for_this_prediction.append((box, label, score))
# If none of the predicted boxes in this picture have an accuracy greater than threshold, store an empty list
valid_boxes_list.append(valid_boxes_for_this_prediction)
# Return the filtered prediction result
return valid_boxes_list

# Set threshold to 0.8 and get the prediction results of resnet and mobilenet
valid_boxes_res = filter_valid_boxes(predictions_res, threshold=0.8)
valid_boxes_mobile = filter_valid_boxes(predictions_mobile, threshold=0.8)

Task 4+6: Visualization + IoU

Tasks 4 & 6

  1. Visualize the model together with the solution: Visualize the predicted bounding boxes and label together with the ground truth bounding
  2. CalculateIoU to compare models: Which backbone delivers the better results? Calculate the IoU for both approaches.

There are a few important points in visual dialog, the steps are as follows:

  • We need to know the id of the image first, and get the annotation information based on the id, then we can Calculate the IoU.
  • We take the annotation information and the model information to conduct the IoU Calculate.
  • We read the location of the image in the computer, and according to the path of the image, we draw the image through plt first.
  • We read the location of the picture in the computer, and based on the path of the picture, we draw the picture through plt first, and then based on the picture, we can draw the prediction box and label on it, as well as the average value of the IoU.

The following program is the procedure described above, we will draw the results of both models and calculate the average of the IoU.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import matplotlib.pyplot as plt
from PIL import Image
import os


# Can put the results of different models into this function, and it will return the average value of IoU
def display_annotated_results(imgId, valid_boxes, model_name, color='g', ax=None):
# Load the image
imgInfo = coco.loadImgs(imgId)[0]
image_path = os.path.join(cocoRoot, dataType, imgInfo['file_name'])
image = Image.open(image_path)

# Get the correct bounding box results
annIds = coco.getAnnIds(imgIds=imgInfo['id'])
anns = coco.loadAnns(annIds)
bbox_tlist_anns = torch.tensor([ann["bbox"] for ann in anns]) # tensor.shape[2,4]
# because our bounding box is x,y,w,h which is the coordinate of the lower left corner of the box (x,y) + the length and width of the box
# But torchvision Calculate the box_iou must give the coordinates of the lower left corner (x,y) and the coordinates of the upper right corner (x2,y2), so we need to Calculate (x2,y2) through (x+w, y+h) to get the coordinates of the upper right corner
# x,y,w,h -> x1,y1,x2,y2 = x,y,x+w,y+h
bbox_tlist_anns[:, 2] = bbox_tlist_anns[:, 0] + bbox_tlist_anns[:, 2]
bbox_tlist_anns[:, 3] = bbox_tlist_anns[:, 1] + bbox_tlist_anns[:, 3]

# From resultsvalid_boxes, we only need the box part, so we use (box, _, _)
# Use stack because we want to stack all the boxes together to become a tensor
bbox_tlist_model = torch.stack([box for box, _, _ in valid_boxes]) # turn [4] to tensor.shape[2,4]
# use box_iou 來Calculate IoU
iou = torchvision.ops.box_iou(bbox_tlist_anns, bbox_tlist_model) # get IoU
# Get the maximum value of each predicted box in ann, and then Calculate the average value of IoU
avg_iou = np.mean([t.cpu().detach().numpy().max() for t in iou]) # calculate the mean of IoU

# display image label
all_labels = set()

# Start drawing the prediction box
for boxes in valid_boxes:
# Get the information of the prediction box, including box, label, and accuracy
box, label, score = boxes
# Get the text information of the label: load category name by category ID
label = coco.loadCats(label.item())[0]["name"]
# Save the label for later display
all_labels.add(label)
# Because the results returned by the model are two coordinates, the lower left corner and the upper right corner, so we need to convert them into x,y,w,h form and put them into Rectangle
x, y, x2, y2 = box.detach().numpy() # x,y,w,h -> x,y,x2-x,y2-y
rect = Rectangle((x, y), x2 - x, y2 - y, linewidth=2, edgecolor=color, facecolor='none')

# Draw the picture: if you need to sort the picture, you can specify where to draw the picture through ax
if ax is None:
# gca can get the current axes, if not, it will automatically create one, and then draw the prediction box through add_patch
plt.gca().add_patch(rect)
# Draw the label on the prediction box
plt.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor=color)
else:
# add_patch can add a patch to the current axes, and then draw the prediction box on the ax
ax.add_patch(rect)
# Draw the label on the prediction box
ax.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor=color)

# display image and give it a title, the title is the label that appears in this picture, and the average value of IoU
if ax is None:
plt.axis('off')
plt.title(f'{model_name}: {all_labels} \n IoU: {avg_iou:.4f}', color=color)
plt.imshow(image)
plt.show()

else:
ax.axis('off')
ax.set_title(f'{model_name}: {all_labels} \n I0U: {avg_iou:.4f}', color=color)
ax.imshow(image)

return avg_iou


res_iou = []
mobile_iou = []

# Recursively call each id, where id is one of the 10 random ids we selected above
for i in range(len(valid_ids)):
# Create a 1x3 grid of images, each image sized 15x5
fig, axs = plt.subplots(1, 3, figsize=(15, 5))

# Draw the truth image and display it in the center
plot_image_with_annotations(coco, cocoRoot, dataType, valid_ids[i], ax=axs[1])

# Draw the predicted results from two different models on the left and right sides respectively, and return the IoU
i_mobil_iou = display_annotated_results(valid_ids[i], valid_boxes_mobile[i], "mobile", color='g', ax=axs[0])
i_res_iou = display_annotated_results(valid_ids[i], valid_boxes_res[i], "ResNet", color='b', ax=axs[2])

# Store the IoU of each image to assess the overall performance of the model
mobile_iou.append(i_mobil_iou)
res_iou.append(i_res_iou)

# Organize the layout
plt.tight_layout()

# Print the mean of the IoU list
print("ResNet: Avg.", np.mean(res_iou), "; each IoU:", res_iou)
print("MobileNet: Avg.", np.mean(mobile_iou), "; each IoU:", mobile_iou)

Result

Supplement: IoU

IoU (Intersection over Union) is a metric used to evaluate object detection algorithms. It is defined as the intersection area of the predicted box and the true box divided by their union area. The value ranges between 0 and 1, where a higher value indicates a greater overlap between the predicted and true boxes, signifying more accurate predictions.



Source: https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

From the example above, you might be wondering, what exactly does the following code segment do?

1
torchvision.ops.box_iou(bbox_tlist_anns, bbox_tlist_model) 
  • Essentially, it calculates the Intersection over Union (IoU) between all predicted bounding boxes of the ground truth and all predicted bounding boxes of the model. This process returns a tensor with the shape (number of ground truth bounding boxes, number of model’s predicted bounding boxes). Refer to the images below.

Here, we only need to obtain the maximum IoU for each ground truth bounding box and calculate the average value, so we use the following code

1
2
# After obtaining the maximum value for each predicted box of the annotation (see supplementary IoU for details), calculate the average IoU
avg_iou = np.mean([t.cpu().detach().numpy().max() for t in iou]) # calculate the mean of IoU

You might be curious whether using functions like max(), mean(), sum() will affect our results?


As we can see from the above image

  • Using sum(), you may find that the value can exceed 1, which is not a reasonable range for IoU values.
  • Using max(), it chooses, for each ground truth bounding box, the closest predicted bounding box from the model as its IoU. Then, we can obtain the maximum IoU values for all predicted bounding boxes of the ground truth and calculate the average to determine the overall IoU.
  • Using mean() poses a problem as the IoU calculation will never be 1. This is because it considers the IoUs of other bounding boxes, which lowers the overall IoU. For instance, if the ground truth has two bounding boxes [A1,A2], and the model also predicts two [B1,B2], it’s clear that B1 predicts A1, and B2 predicts A2, and the model predicts accurately. However, using mean will incorrectly consider B1 as A2 and B2 as A1, which is wrong, and these pairs have low IoUs. Thus, using mean() in this way will unjustly lower the IoU, making it unreasonable.