前言

最近選了一堂AI課程，這是第四個作業，主要教授內容為以下主題：

Download Coco dataset
User pre-trained version of Faster R-CNN to predict the bounding box
Calculate IoU

作業要求

下載coco資料集：Download the file „2017 Val images [5/1GB]“ and „ 2017 Train/Val annotations [241MB]“ from
the Coco page. You can use the library pycocotools to load them into your notebook.
隨機從dataset選擇十張：Randomly select 10 images from this dataset.
使用pre-trained模型FasterR-CNN預測bbox：Use a pre-trained version of Faster R-CNN (Resnet50 backbone) to predict the bounding box
of objects on the 10 images. Only keep regions that have a score > 0.8.
把模型跟解答視覺化擺在一起：Visualize the predicted bounding boxes and label together with the ground truth bounding
boxes and label. Show all 10 pairs of images side by side in the jupyter notebook.
使用另一個pre-trained模型Mobilnet：Repeat the steps from above using a Mobilenet backbone for the Faster R-CNN.
計算IoU比較模型：Wich backbone delivers the better results? Calculate the IoU for both approaches.

Task 1: 下載coco資料集

Task 1

下載coco資料集：Download the file „2017 Val images [5/1GB]“ and „ 2017 Train/Val annotations [241MB]“ from the Coco page. You can use the library pycocotools to load them into your notebook.

可以看照這個說明進行下載：https://jason-chen-1992.weebly.com/home/coco-dataset

.
├── annotations # 這是標注資料
│   ├── captions_train2017.json
│   ├── captions_val2017.json
│   ├── instances_train2017.json
│   ├── instances_val2017.json
│   ├── person_keypoints_train2017.json
│   └── person_keypoints_val2017.json
└── val2017 # 這是圖片集 
    ├── 000000000139.jpg
    ├── 000000000285.jpg
    ├── 000000000632.jpg
    ├── 000000000724.jpg
    ├── 000000000776.jpg
    ├── 000000000785.jpg
    ├── 000000000802.jpg
    ...

去官網下載這兩個檔案如圖一
下載後資料夾解壓縮會如上面的檔案結構

Task 2: 隨機選十張

Task 2
2. 隨機從dataset選擇十張：Randomly select 10 images from this dataset.

這邊我們主要會做幾件事情：

匯入必要套件
設定coco api，讓他可以引入我們的資料集的相關資訊，像是預測框位置、標籤位置、圖片資訊
視覺化圖片並且進行標示
隨機選十個圖片

我們先匯入必要的套件


# CNN 
import torch.nn.functional as F
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
import torch.backends.cudnn as cudnn

# others
import numpy as np
import matplotlib.pyplot as plt
import time
import os
from PIL import Image
from tempfile import TemporaryDirectory
import time
import random

# torchvision
import torchvision
import torchvision.transforms as transforms

# dataset 
from pycocotools.coco import COCO
import cv2


cudnn.benchmark = True
plt.ion()   # interactive mode

設定coco api

coco 有提供獲取資料集的 api，只要給他json檔案，我們就可以輕易的我們可以透過這個 api 來根據json檔案，獲取我們需要的資料，像是圖片、標籤、預測框等等。

# 指定資料集位置
cocoRoot = "../../Data/Coco/"
dataType = "val2017"

# 設定標註檔案位置
annFile = os.path.join(cocoRoot, f'annotations/instances_{dataType}.json')
print(f'Annotation file: {annFile}')
# # initialize COCO api for instance annotations
coco=COCO(annFile)
coco

結果如下

Annotation file: ../../Data/Coco/annotations/instances_val2017.json
## 表示成功讀取標註檔案
loading annotations into memory...
Done (t=0.35s)
creating index...
index created!

標註視覺化

為了確保會使用coco所提供的API這邊有一個練習，主要學習以下內容：

取得 image info by id
取得 annotation info by id
學會在 image 上畫 bounding box 並標籤

import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# 做一個function，只要給他 image id 就可以畫出圖片並且標示出 bounding box和標籤
def plot_image_with_annotations(coco, cocoRoot, dataType, imgId, ax=None):
    # 取得圖片資訊  
    imgInfo = coco.loadImgs(imgId)[0]
    # 取得圖片位置 準備拿來視覺化 
    imPath = os.path.join(cocoRoot, dataType, imgInfo['file_name'])    
    # 讀取圖片
    im = cv2.imread(imPath)
    # 轉換色彩空間 cv2 的預設色彩空間為 BGR，但是 matplotlib 的預設色彩空間為 RGB，因此這邊需要轉換一下
    im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)

    # 找到圖片的所有標註 bounding boxes 
    annIds = coco.getAnnIds(imgIds=imgInfo['id'])
    # 取得所有標註資訊，回傳每個box的座標資訊, 標籤, 準確率
    anns = coco.loadAnns(annIds)
    all_labels = set()

    # 座標資訊, 標籤, 準確率 
    for ann in anns:
        # 特別只選取 bbox 的資訊，會傳回 (x, y)圖片左下角, w（框寬）, h（寬高）
        x, y, w, h = ann['bbox']

        # 取得標籤的文字資訊，load category name by category id 
        label = coco.loadCats(ann['category_id'])[0]["name"]
        all_labels.add(label)

        # 提供坐標資訊畫出 bounding box
        rect = Rectangle((x, y), w, h, linewidth=2, edgecolor='r', facecolor='none')
    
        # 畫出圖片，因為我可能需要排序圖片，所以可以透過 ax 來指定圖片畫在哪個位置
        if ax is None:
            # gca 可以取得目前的 axes，如果沒有就會自動創建一個，axes 你可以想像他是一個畫布，你可以在上面畫點,線,圖,文字等等
            # 然後透過 add_patch 把 預測框 畫上去 
            plt.gca().add_patch(rect) 
            # plt.text() 它會將文字標籤加入到目前的Axes物件中，會畫出標籤
            plt.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor='r')
        # 如果沒有 ax 就直接畫在 plt 上 
        else:
            ax.add_patch(rect)
            ax.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor='r' )

    # 顯示圖片並給他一個標題
    if ax is None:
        plt.imshow(im)
        plt.axis('off')
        plt.title(f'Ans: {all_labels}', color='r')
        plt.show()
    else:
        ax.axis('off')
        ax.set_title(f'Ans: {all_labels}', color='r',loc='center', pad=20)
        ax.imshow(im)


# 取第十張圖片
imgIds = coco.getImgIds()
imgId = imgIds[10]
# 繪製出來
plot_image_with_annotations(coco, cocoRoot, dataType, imgId)

輸出結果

隨機選10張

def random_select(coco, cocoRoot, dataType, num_images=10):
    # 取得所有圖片的 id
    imgIds = coco.getImgIds()
    # 從這些 id 中隨機選取 num_images 個 id 
    selected_imgIds = random.sample(imgIds, num_images)
    # 遞迴呼叫每個 id 
    for imgId in selected_imgIds:
        # 根據 id 繪製圖片
        plot_image_with_annotations(coco, cocoRoot, dataType, imgId)
    
    # 最後印出所有選取的 id
    return selected_imgIds
    
valid_ids = random_select(coco, cocoRoot, dataType, num_images=10)
valid_ids

輸出結果

Task 3+5: FasterR-CNN v.s Mobilnet

Task 3 & 5
3. 使用pre-trained模型FasterR-CNN預測bbox：Use a pre-trained version of Faster R-CNN (Resnet50 backbone) to predict the bounding box
of objects on the 10 images. Only keep regions that have a score > 0.8.
5. 使用另一個pre-trained模型Mobilnet：Repeat the steps from above using a Mobilenet backbone for the Faster R-CNN.

引用 pre-train model

# 引用 pre-train model  (FasterR-CNN)
model_res = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="FasterRCNN_ResNet50_FPN_Weights.DEFAULT")
model_res.eval()

# 引用 pre-train model  (Mobilenet)
model_mobile = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_fpn(weights=torchvision.models.detection.FasterRCNN_MobileNet_V3_Large_FPN_Weights)
model_mobile.eval()

圖片轉換成 tensor函式

我們要先能夠根據圖片的位置，圖取圖片出來。然後把讀書來得圖片，轉換成 tensor，才能放入 model 中進行預測。所以我們做了兩個函式:

一個是讀取圖片
一個是把圖片轉成 tensor。

from PIL import Image

def load_image(imgIdx):
    # 取得圖片資訊
    imgInfo = coco.loadImgs(imgIdx)[0]
    # 取得圖片位置 準備拿來視覺化
    imPath = os.path.join(cocoRoot, dataType, imgInfo['file_name'])    
    # 印出圖片位置
    print(imPath)
    try:
        # 讀取圖片
        return Image.open(imPath)
    except:
        raise Exception()

# 把圖片轉成 tensor 才可以放入 model 中
def pil2tensor(pil_image):
    # 使用 unsqueeze(0) 是因為 model 的 還包含 batch size 的維度，共四個維度 (batch_size, channel-RGB, height, width)
    # 但是圖片只有一張沒有batch size，圖片轉tensor會只有三個維度（channel-RGB, height, width），所以我們需要增加一個維度
    # /255 是因為 model 的輸入是 0~1 之間的數字，而圖片的數值是 0~255，所以需要除以 255 來做正規化
    return torchvision.transforms.PILToTensor()(pil_image).unsqueeze(0) / 255.0

訓練模型

前置作業都準備好了，我們就可以開始使用pre-trained好的模型進行預測，並儲存回傳的結果，等等視覺化使用。

# 用來存放預測結果
predictions_res = []
predictions_mobile = []

# 遞迴呼叫每個 id，這些id是我們在上面隨機選取的10個id
for i in valid_ids:
    print(i)
    # transform to tensor from PIL image
    img_as_tensor = pil2tensor(load_image(i))
    # put the tensor to resnet model
    prediction = model_res(img_as_tensor)
    # 存放預測結果：預測結果是一個字典，裡面包含了預測的 bounding box, 標籤, 準確率
    predictions_res.append(prediction)
    # put the tensor to mobilenet model
    prediction = model_mobile(img_as_tensor)
    # 存放預測結果：預測結果是一個字典，裡面包含了預測的 bounding box, 標籤, 準確率
    predictions_mobile.append(prediction)

只選擇>0.8的預測結果

在收集好所有的結果後，我們要特別從這一大堆預測框中，只挑選準確率大於 0.8 的預測框，這樣視覺化的時候才不會太亂。

def filter_valid_boxes(predictions, threshold=0.8):
    # 用來存放過濾後的預測結果
    valid_boxes_list = []
    # 遞迴呼叫每個預測結果
    for prediction in predictions:
        valid_boxes_for_this_prediction = []
        # 遞迴呼叫每個預測框
        for box, label, score in zip(prediction[0]["boxes"], prediction[0]["labels"], prediction[0]["scores"]):
            # 保留準確率大於 threshold 的預測框
            if score >= threshold:
                # 把預測框, 標籤, 準確率存起來
                valid_boxes_for_this_prediction.append((box, label, score))
        # 如果這張圖片沒有任何一個預測框的準確率大於 threshold，就存一個空的 list
        valid_boxes_list.append(valid_boxes_for_this_prediction)
    # 回傳過濾後的預測結果
    return valid_boxes_list

# 把 threshold 設定為 0.8，並且獲取 resnet 與 mobilenet 的預測結果 
valid_boxes_res = filter_valid_boxes(predictions_res, threshold=0.8)
valid_boxes_mobile = filter_valid_boxes(predictions_mobile, threshold=0.8)

Task 4+6: 視覺化 + IoU

Task 4 & 6
4. 把模型跟解答視覺化擺在一起：Visualize the predicted bounding boxes and label together with the ground truth bounding
6. 計算IoU比較模型：Wich backbone delivers the better results? Calculate the IoU for both approaches.

視覺話有很重要的幾點，步驟大概如下：

要先知道圖片id，根據id取得annotation的資訊，這樣才可以計算 IoU
我們將ann的資訊跟model的資訊，進行 IoU 的計算
我們讀取圖片在電腦中的位置，根據圖片路徑，把圖片先透過plt畫出來
然後基於這個圖片，才可以在上面畫上預測框跟標籤還有 IoU 的平均值

以下程式就是上述所描述的步驟，我們會把兩個模型的結果都畫出來，並且計算 IoU 的平均值。

import matplotlib.pyplot as plt
from PIL import Image
import os


# 可以把不同 model 的結果放入此函式，並且會回傳 IoU 的平均值
def display_annotated_results(imgId, valid_boxes, model_name, color='g', ax=None):
    # Load the image
    imgInfo = coco.loadImgs(imgId)[0]
    image_path = os.path.join(cocoRoot, dataType, imgInfo['file_name'])
    image = Image.open(image_path)

    # 取得正確的 bounding box 結果 
    annIds = coco.getAnnIds(imgIds=imgInfo['id'])
    anns = coco.loadAnns(annIds)
    bbox_tlist_anns = torch.tensor([ann["bbox"] for ann in anns]) # tensor.shape[2,4]
    # 因為我們的 bounding box 是 x,y,w,h 也就是 框框 左下角的座標(x,y) ＋ 框框的長與寬
    # 但是 torchvision 計算出 IoU 的 box_iou 必須給予 左下角的座標(x,y) 跟 右上角的座標(x2,y2)，所以要透過(x+w, y+h)來計算(x2,y2) 取得右上角的座標 
    # x,y,w,h -> x1,y1,x2,y2 = x,y,x+w,y+h 
    bbox_tlist_anns[:, 2] = bbox_tlist_anns[:, 0] + bbox_tlist_anns[:, 2]
    bbox_tlist_anns[:, 3] = bbox_tlist_anns[:, 1] + bbox_tlist_anns[:, 3]
    
    # 從結果valid_boxes中，我們只要box的部分，把label, score拿掉，因此我們使用(box, _, _) 
    # 使用 stack 是因為我們要把所有的 box 疊起來，變成一個 tensor   
    bbox_tlist_model = torch.stack([box for box, _, _ in valid_boxes]) # turn [4] to tensor.shape[2,4]
    # 使用 box_iou 來計算 IoU 
    iou = torchvision.ops.box_iou(bbox_tlist_anns, bbox_tlist_model) # get IoU 
    # 取得ann每個預測框的最大值後（可以看補充IoU了解詳細），進行 IoU 的平均值
    avg_iou = np.mean([t.cpu().detach().numpy().max() for t in iou]) # calculate the mean of IoU

    # 顯示圖片標籤
    all_labels = set()

    # 開始繪製預測框
    for boxes in valid_boxes:
        # 取得預測框的資訊 包含 box, label, score 
        box, label, score = boxes
        # 取得標籤的文字資訊，load category name by category id 
        label = coco.loadCats(label.item())[0]["name"]
        # 把標籤存起來後續顯示用
        all_labels.add(label)
        # 模型回傳的結果是兩個座標，左下角與右上角，所以我們要把他們轉換成 x,y,w,h 的形式放入Rectangle 
        x, y, x2, y2 = box.detach().numpy() # x,y,w,h -> x,y,x2-x,y2-y
        rect = Rectangle((x, y), x2 - x, y2 - y, linewidth=2, edgecolor=color, facecolor='none')

        # 繪製圖片，因為我可能需要排序圖片，所以可以透過 ax 來指定圖片畫在哪個位置
        if ax is None:
            # gca 可以取得目前的 axes，如果沒有就會自動創建一個，然後透過 add_patch 把 預測框 畫上去
            plt.gca().add_patch(rect) 
            # 在預測框上面畫上標籤
            plt.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor=color)
        else:
            # 不用 gca 是因為我們已經有指定 ax 了，所以直接在 ax 上畫就好
            ax.add_patch(rect)
            # 在預測框上面畫上標籤 
            ax.text(x, y, f'{label}', fontsize=10, color='w', backgroundcolor=color)
    
    # 顯示圖片並給他一個標題，標題是這個圖片有出現過的標籤，以及 IoU 的平均值
    if ax is None:
        plt.axis('off')
        plt.title(f'{model_name}: {all_labels} \n IoU: {avg_iou:.4f}', color=color)
        plt.imshow(image)
        plt.show()

    else:
        ax.axis('off')
        ax.set_title(f'{model_name}: {all_labels} \n I0U: {avg_iou:.4f}', color=color)
        ax.imshow(image)
    
    return avg_iou


res_iou = []
mobile_iou = []


# 遞迴呼叫每個 id，id是我們在上面隨機選取的10個id 
for i in range(len(valid_ids)):
    # 建立一個 1x3 的圖片，每一張圖片的大小為 15x5 
    fig, axs = plt.subplots(1, 3, figsize=(15, 5))   

    # draw truth image 繪製正確解答，並且把圖片畫在中間的圖片上
    plot_image_with_annotations(coco, cocoRoot, dataType, valid_ids[i], ax=axs[1])

    # 繪製兩個不同模型的預測結果圖片，分別放在左右兩側，並回傳 IoU 
    i_mobil_iou = display_annotated_results(valid_ids[i], valid_boxes_mobile[i], "mobile", color='g', ax=axs[0])
    i_res_iou = display_annotated_results(valid_ids[i], valid_boxes_res[i], "ResNet", color='b', ax=axs[2])

    # 儲存每個圖片的 IoU 以得知整個模型的表現 
    mobile_iou.append(i_mobil_iou)
    res_iou.append(i_res_iou)

    # organize the layout
    plt.tight_layout()

# print the mean of IoU list 
print("ResNet: Avg.", np.mean(res_iou), "; each IoU:", res_iou)
print("MobileNet: Avg.", np.mean(mobile_iou), "; each IoU:", mobile_iou)

輸出結果

補充：IoU

IoU (Intersection over Union) 是一個用來評估物件偵測演算法的指標，其定義為預測框與真實框其交集面積 / 聯集面積，其值介於 0 與 1 之間，值越大代表預測框與真實框的重疊程度越高，也就是預測框越準確。

取自：https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object-detection/

從上面的範例中，你可能會好奇，以下這段程式到底做了什麽？

1	torchvision.ops.box_iou(bbox_tlist_anns, bbox_tlist_model)

其實基本上他就是把解答的所有預測框，跟 model 的所有預測框，兩兩計算 IoU，並且回傳一個 tensor，其 shape 為 (解答的預測框數量, model 的預測框數量)，參考下圖。

這邊我們只需要取得每個解答預測框的最大 IoU，並且計算平均值即可，因此我們使用以下程式碼

1 2	# 取得ann每個預測框的最大值後（可以看補充IoU了解詳細），進行 IoU 的平均值 avg_iou = np.mean([t.cpu().detach().numpy().max() for t in iou]) # calculate the mean of IoU

你可能會好奇，使用 max(), mean(), sum() 這些函式，是否會影響我們的結果？

可以從上圖看到其實會發現

使用 sum() 你會發現他有可能會超過1，這並不是 IoU 合理的數值範圍。
使用 max() 他會針對解答的預測框，從 model 選一個最接近的預測框，當作該預測框的IoU，之後就可以取得解答的所有預測框其最大值，並進行平均來獲得整體的 IoU。
使用 mean() 會有一個問題，這個 IoU的計算永遠不可能為1，因為你考慮到其他預測框的 IoU，這樣就會把 IoU 降低，舉例來說，解答預測框有兩個[A1,A2]，而模型也產生兩個[B1,B2]，一眼就知道，B1預測的是A1，而B2預測A2，而模型也預測的很準確。但是你卻透過 mean 把B1當作A2，B2當作A1，顯然是錯的，而且這兩組的 IoU 也很低，你如果透過 mean() 這樣就會把 IoU 降低，這樣就不合理了。