Train a text classification on the TweetEval emotion recognition dataset using LSTMs and GRUs.
建立LSTM模型:Follow the example described here. Use the same architecture, but:
only use the last output of the LSTM in the loss function
use an embedding dim of 128
use a hidden dim of 256.
使用SpaCy切割字:Use spaCy to split the tweets into words.
挑選Top5000的字:Limit your vocabulary (i.e. the words that you converted to an index) to the most frequent 5000
words and replace all other words with an placeholder index (e.g. 1001).
訓練模型並計算準確度:Evaluate the accuracy on the test set. (Note: If the training takes to long, try to use only a fraction of the training data.)
建立GRU模型,並訓練:Do the same, but this time use GRUs instead of LSTMs.
# CNN import torch.nn.functional as F import torch import torch.nn as nn import torch.optim as optim from torch.optim import lr_scheduler
# others import numpy as np import matplotlib.pyplot as plt import time import os from PIL import Image from tempfile import TemporaryDirectory import time
# dataset import torchvision from torchvision import datasets, models, transforms from torchvision.datasets import Flowers102
# read file import pandas as pd
# label from scipy.io import loadmat import json
接著我們將資料轉換成我們需要的格式,這邊我們使用 panda 來處理資料,並將資料讀取至變數中,方便我們之後使用。
# The LSTM takes word embeddings as inputs, and outputs hidden states # with dimensionality hidden_dim. self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout)
# The linear layer that maps from hidden state space to tag space self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
defforward(self, sentence): embeds = self.word_embeddings(sentence) # 將輸入的句子中的每個詞都轉換成詞向量 此時的 sentence 已經是 index 形式的向量 lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1)) # 將詞向量作為LSTM模型的輸入 得到LSTM曾的輸出和隱藏狀態 # Take only the last output of the LSTM last_output = lstm_out[-1].view(1, -1) # Selecting the last output 為了滿足作業要求,我們只取最後一個輸出 tag_space = self.hidden2tag(last_output) # 將LSTM模型的最後輸出轉換成 詞標籤 空間 tag_scores = F.log_softmax(tag_space, dim=1) # 將詞標籤 空間 轉換成 機率空間 return tag_scores
我們已經在 Task0 把所需要的資料都放入變數list當中,每筆資料都是一個句子,我們現在要做幾件事情:
2. 使用SpaCy切割字:Use spaCy to split the tweets into words.
3. 挑選Top5000的字:Limit your vocabulary (i.e. the words that you converted to an index) to the most frequent 5000
words and replace all other words with an placeholder index (e.g. 1001).
# join all the sentence together # e.g. ['today is good', 'today is bad'] => ['today is good today is bad'] text = ' '.join(train_dataset)
# use spacy to tokenize the sentence doc = nlp(text)
# filter out the punctuation and stop words word_freq = Counter(token.text for token in doc \ ifnot token.is_punct and \ not token.is_stop and \ not token.is_space ) word_freq
# 轉換單詞為索引,超出詞彙表的單詞用佔位索引 5000 代替 因為我們會收集前 0-4999 index 的單詞 placeholder_index = 5000 # 存放整個 dataset 轉換成 index 的結果 indexed_dataset = [] for tweet in train_dataset: # 取出第一個句子 indexed_words = [] # 建立一個空的 list 存放當前句子的結果 (e.g. I like apple -> [100, 3923, 123]) for token in nlp(tweet): # 透過 spacy 切割句子成單詞 ifnot token.is_punct andnot token.is_stop andnot token.is_space: # 確保單字不是標點符號、停用字、空白 word = token.text if word in vocab: # 如果該單字在我們的常見 5000 單字中,就把它轉換成 index indexed_words.append(vocab[word]) else: # 否則 index 就是 placeholder_index indexed_words.append(placeholder_index) indexed_dataset.append(indexed_words)
那根據上面的說明,我們可以把上面的程式碼包裝成一個 function,方便我們之後在進行訓練時,把句子轉換成 index list:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# for sentence to sequence defprepare_sentence_sequence(seq, to_ix): idx = [] # use spacy to tokenize the sentence for token in nlp(seq): # filter out the punctuation and stop words and space ifnot token.is_punct andnot token.is_stop andnot token.is_space: word = token.text # if the token is in the top 5000 words in the vocab, add its index to the list if word in to_ix: idx.append(to_ix[word]) else: # else add the index of the placeholder token idx.append(placeholder_index) return torch.tensor(idx, dtype=torch.long) # 把 list 轉換成 tensor
將標籤轉換成 tensor
接下來,我們要處理標籤,標籤也需要轉換成向量,這樣 model 的 ouput 才可以與 正確解答 做比較:
# See what the scores are before training # Here we don't need to train, so the code is wrapped in torch.no_grad() sentence_idx = 1# 拿第一個句子來測試 # 印出:My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs print(f'First Sentense = {train_dataset[sentence_idx]}')
with torch.no_grad(): # 這時候我們可以把第一個句子轉換成 index 的形式,並且把它轉換成 tensor inputs = prepare_sentence_sequence(train_dataset[sentence_idx], word_to_ix) print(f'Sentense to tensor = {inputs}') # 印出:tensor([1070, 340, 2015, 2016, 45, 2017]) # 然後把解答轉換成 tensor labels = one_hot_encode(train_label_pd[0][sentence_idx], tag_to_ix) print(f'Sentense of result to tensor = {labels}') # 印出:tensor([1., 0., 0., 0.]) # 把 inputs 送入模型中,得到模型的預測結果 outputs = model(inputs) print(f'tag_scores = {outputs}') # 印出:tensor([[-1.3280, -1.4272, -1.4998, -1.3026]])
# 計算 loss 看看 output 跟 label 的差距,這邊 output[0] 是因為發現 output 多包一層 result_idx = torch.argmax(outputs).item() print(f'result = {result_idx}, ans = {train_label_pd[0][sentence_idx]}') # 印出:result = 3, ans = 0
# 計算 loss 看看 output 跟 label 的差距,這邊 output[0] 是因為發現 output 多包一層 loss = loss_function(outputs[0], labels) print(f'loss = {loss}')
輸出結果
1 2 3 4 5 6 7
First Sentense = My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs Sentense to tensor = tensor([1070, 340, 2015, 2016, 45, 2017]) Sentense of result to tensor = tensor([1., 0., 0., 0.]) tag_scores = tensor([[-1.3280, -1.4272, -1.4998, -1.3026]]) loss = 1.32795250415802 preds = tensor([3]) result = 3, ans = 0
看起來運行的還挺順暢的對吧?
那我們正式開始囉!
準備 training 用的函示
這邊我希望在 training 每次的 epoch 時:
列印出 training 的 loss 和 accuracy 來確認模型的訓練狀況。
同時保留最好的model。
計算訓練時間。
我們預期輸出的結果會長這樣:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Epoch 0/29 ---------- train Loss: 1.2157 Acc: 0.4642 Time elapsed: 25 sec. -- 列印出 training 的 accuracy 來確認模型的訓練狀況。 test Loss: 1.2095 Acc: 0.4553 Time elapsed: 32 sec. -- 列印出 testing 的 accuracy 來確認模型的訓練狀況。
Epoch 1/29 ---------- train Loss: 1.1019 Acc: 0.5333 Time elapsed: 58 sec. test Loss: 1.1816 Acc: 0.4708 Time elapsed: 65 sec.
Epoch 2/29 ---------- train Loss: 1.0151 Acc: 0.5812 Time elapsed: 92 sec. test Loss: 1.1603 Acc: 0.4898 Time elapsed: 99 sec. ... Training complete in17m 5s -- 列印出 訓練所有 epoch 的時間。 Best val Acc: 0.599578 # -- 列印出並保留 最好的 accuracy 的 model
# 開始訓練 n 個 epoch for epoch inrange(num_epochs): print(f'Epoch {epoch}/{num_epochs - 1}') print('-' * 10)
# Each epoch has a training and validation phase for phase in ['train', 'test']: if phase == 'train': model.train() else: model.eval() running_loss = 0.0 running_corrects = 0
# Iterate over data. forinput, label inzip(dataloaders[phase], resultloaders[phase]): # ===== !!! Here !!! ====== # 這邊就會使用到我們Task 2+3 所建立的函式,把句子轉換成 index 的形式,還有把label轉換成向量的形式 # e.g. tensor([1070, 340, 2015, 2016, 45, 2017]) inputs_vector = prepare_sentence_sequence(input, word_to_ix) # e.g. tensor([1., 0., 0., 0.]) labels_vector = one_hot_encode(label, tag_to_ix) # ===== !!! End !!! ======
# zero the parameter gradients optimizer.zero_grad()
# forward # track history if only in train with torch.set_grad_enabled(phase == 'train'): # 以下就會跟小試身手類似 # 取得針對每個emotion的預測結果tensor outputs = model(inputs_vector) # (e.g. tensor([[-1.3948, -1.4476, -1.3804, -1.3261]]))
# ===== !!! Here !!! ====== # 取得最大值的index pred = torch.argmax(outputs).item() # (e.g. 2) # 外面還有一層,只需取得內層 [-1.3948, -1.4476, -1.3804, -1.3261] 與 [0, 0, 1, 0] 的計算loss loss = criterion(outputs[0], labels_vector) # # ===== !!! End !!! ======
# backward + optimize only if in training phase if phase == 'train': loss.backward() optimizer.step()
# statistics running_loss += loss.item() if pred == label: running_corrects += 1
Epoch 2/29 ---------- train Loss: 0.9885 Acc: 0.5840 Time elapsed: 97 sec. test Loss: 1.1279 Acc: 0.5236 Time elapsed: 104 sec.
Epoch 3/29 ---------- train Loss: 0.8893 Acc: 0.6371 Time elapsed: 132 sec. test Loss: 1.1053 Acc: 0.5369 Time elapsed: 139 sec.
Epoch 4/29 ---------- train Loss: 0.7683 Acc: 0.7003 Time elapsed: 168 sec. test Loss: 1.0772 Acc: 0.5658 Time elapsed: 175 sec. ... test Loss: 1.1330 Acc: 0.6059 Time elapsed: 1040 sec.
Training complete in17m 20s Best val Acc: 0.610134
Epoch 3/29 ---------- train Loss: 0.8445 Acc: 0.6702 Time elapsed: 131 sec. test Loss: 1.1211 Acc: 0.5327 Time elapsed: 138 sec.
Epoch 4/29 ---------- train Loss: 0.6843 Acc: 0.7393 Time elapsed: 166 sec. test Loss: 1.1305 Acc: 0.5707 Time elapsed: 173 sec. ... test Loss: 1.3237 Acc: 0.6073 Time elapsed: 1003 sec.
Training complete in16m 43s Best val Acc: 0.608726