Recently, I took an AI course, and this is the sixth assignment. The main content taught includes the following topics:
Learn to use LSTM
Use SpaCy
Homework Requirements
Train a text classification on the TweetEval emotion recognition dataset using LSTMs and GRUs.
Build an LSTM model: Follow the example described here. Use the same architecture, but:
only use the last output of the LSTM in the loss function
use an embedding dim of 128
use a hidden dim of 256.
Use SpaCy to split words: Use spaCy to split the tweets into words.
Select the Top 5000 words: Limit your vocabulary (i.e., the words that you converted to an index) to the most frequent 5000 words and replace all other words with a placeholder index (e.g., 1001).
Train the model and calculate accuracy: Evaluate the accuracy on the test set. (Note: If the training takes too long, try to use only a fraction of the training data.)
Build and train a GRU model: Do the same, but this time use GRUs instead of LSTMs.
Task 0: Download the Dataset
In this section, we need to do the following:
Download the dataset
Use pandas to convert the dataset into the format we need
Download the Dataset
Refer to this link to download the required data: TweetEval
After downloading, you can see the following information, the emotion folder is the information we will use this time:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
. ├── README.md ├── TweetEval_Tutorial.ipynb ├── datasets │ ├── README.txt │ ├── emoji │ ├── emotion # This is the data we need │ │ ├── mapping.txt # Emotion corresponding to numbers e.g. {0:'angry', 1:'happy'} │ │ ├── test_labels.txt # Emotion labels for test data, i.e., the answers e.g. 0 │ │ ├── test_text.txt # Content of the test data e.g. "I'm so angry" │ │ ├── train_labels.txt # Emotion labels for training data, i.e., the answers e.g. 0 │ │ ├── train_text.txt # Content of the training data e.g. "I'm so angry" │ │ ├── val_labels.txt # Emotion labels for validation data, i.e., the answers e.g. 0 │ │ └── val_text.txt # Content of the validation data e.g. "I'm so angry" ...
# CNN import torch.nn.functional as F import torch import torch.nn as nn import torch.optim as optim from torch.optim import lr_scheduler
# others import numpy as np import matplotlib.pyplot as plt import time import os from PIL import Image from tempfile import TemporaryDirectory import time
# dataset import torchvision from torchvision import datasets, models, transforms from torchvision.datasets import Flowers102
# read file import pandas as pd
# label from scipy.io import loadmat import json
Then we convert the data into the format we need, this time we use panda to process the data and read the data into variables for later use.
Make sure to change the root path to the folder path of your git clone!
# Set the relative path of each file first root = '../../Data/tweeteval/datasets/emotion/' mapping_file = os.path.join(root, 'mapping.txt') test_labels_file = os.path.join(root, 'test_labels.txt') test_text_file = os.path.join(root, 'test_text.txt') train_labels_file = os.path.join(root, 'train_labels.txt') train_text_file = os.path.join(root, 'train_text.txt') val_labels_file = os.path.join(root, 'val_labels.txt') val_text_file = os.path.join(root, 'val_text.txt')
# Use panda to read the data and read the labels mapping_pd = pd.read_csv(mapping_file, sep='\t', header=None) test_label_pd = pd.read_csv(test_labels_file, sep='\t', header=None) train_label_pd = pd.read_csv(train_labels_file, sep='\t', header=None) val_label_pd = pd.read_csv(val_labels_file, sep='\t', header=None)
# Use \n to split the content of the training and testing data, and remove the last blank word_embeddings # because test_dataset[-1] is empty, and the length will be consistent with the length of labels after removing the length test_dataset = open(test_text_file).read().split('\n')[:-1] # remove last empty line train_dataset = open(train_text_file).read().split('\n')[:-1] # remove last empty line val_dataset = open(val_text_file).read().split('\n')[:-1] # remove last empty line
# Print the length of the dataset print(f'len(train_dataset)= {len(train_dataset)}') print(f'len(train_label_pd)= {len(train_label_pd)}') print(f'=== train_label_pd === \n{train_label_pd.value_counts()}') print(f'len(test_dataset)= {len(test_dataset)}') print(f'len(test_label_pd)= {len(test_label_pd)}') print(f'=== test_label_pd === \n{test_label_pd.value_counts()}')
Build an LSTM model: Follow the example described here. Use the same architecture, but:
only use the last output of the LSTM in the loss function
use an embedding dim of 128
use a hidden dim of 256.
Build and train a GRU model: Do the same, but this time use GRUs instead of LSTMs.
From the official example, we can learn how to build an LSTM model, which basically includes the following elements:
hidden_dim: The dimension of the hidden layer, representing the number of neurons in the hidden layer.
word_embeddings: Converts each word in the input sentence into word vectors.
embedding_dim(vocab_size, embedding_dim):
vocab_size: The size of the dictionary, i.e., the total number of words we have. In this example, we will input 5001 words: 5000 common words + 1 unrecognized word.
embedding_dim: Represents mapping each word or symbol to a fixed-size vector space. For instance, if your embedding_dim is set to 6, and your input vector is [1, 2, 3, 5], the model will map each number to a six-dimensional vector space, forming a representation like [1, 2, 3, 5, 6, 4].
lstm(input_size, hidden_size, dropout)
input_size: The dimension of the input, which is our word vector dimension.
hidden_size: The dimension of the hidden layer, representing the number of neurons in the hidden layer.
dropout: The proportion of dropout, default is 0, meaning no dropout is used.
hidden2tag(in_features, out_features)
in_features: The input dimension, which is also the word vector dimension.
out_features: The output dimension, which is the dimension of our emotion labels.
def__init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size, dropout=0.0): super(LSTMTagger, self).__init__() self.hidden_dim = hidden_dim # Convert the input word into a word vector self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# The LSTM takes word embeddings as inputs, and outputs hidden states # with dimensionality hidden_dim. self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout)
# The linear layer that maps from hidden state space to tag space self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
defforward(self, sentence): # convert the input word into a word vector. Now the sentence is already an index vector embeds = self.word_embeddings(sentence) # Use the index vector as the input of the LSTM model to get the output and hidden state of the LSTM layer lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1)) # Take only the last output of the LSTM last_output = lstm_out[-1].view(1, -1) tag_space = self.hidden2tag(last_output) # Use the last output of the LSTM model to convert to the word tag space tag_scores = F.log_softmax(tag_space, dim=1) # Use log_softmax to convert to probability return tag_scores
GRU and LSTM are similar, except that the only thing to modify is:
1 2 3 4 5 6 7 8 9 10 11 12
classGRUTagger(nn.Module): def__init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size, dropout=0.0): ... # Here !!! Change to GRU self.gru = nn.GRU(embedding_dim, hidden_dim, dropout=dropout) defforward(self, sentence): ... # Here !!! Change to GRU ## Use the index vector as the input of the GRU model to get the output and hidden state of the GRU layer gru_out, _ = self.gru(embeds.view(len(sentence), 1, -1)) # 將詞向量作為LSTM模型的輸入 得到LSTM曾的輸出和隱藏狀態 last_output = gru_out[-1].view(1, -1) # Selecting the last output 為了滿足作業要求,我們只取最後一個輸出 ...
After the above modification, the completed LSTM program is as follows:
Task 2 + 3: Split Words Using SpaCy, Find Top 5000 Words
We have already placed the necessary data into a list variable in Task 0, with each data entry being a sentence. Now, we need to do a few things:
2. Split Words Using SpaCy: Use spaCy to split the tweets into words.
3. Select Top 5000 Words: Limit your vocabulary (i.e., the words that you converted to an index) to the most frequent 5000 words and replace all other words with a placeholder index (e.g., 1001).
Install SpaCy
We need to execute the following commands to install the SpaCy package:
1 2 3 4
# If you are using Python3 pip install -U spacy # If you are using Anaconda conda install -c conda-forge spacy
As we are analyzing English text, we need to download the English model. Execute the following command:
1
python -m spacy download en_core_web_sm
Only then can we import the spacy package in the notebook and use the English model.
If the above command is not executed, you will encounter an error here!!
nlp = spacy.load(“en_core_web_sm”)
1 2 3 4 5
import spacy from collections import Counter
# use spacy to tokenize the sentence with english model nlp = spacy.load("en_core_web_sm") # <=== If the above command is not executed, you will encounter an error here!!
Prepare a Dictionary of Top 5000 Common Words
We need to identify the top 5000 common words and create a dictionary for this purpose:
First, prepare a string concatenating all sentences.
Then, send the entire string to spacy for data segmentation, filtering out punctuation (punct), stop words, and spaces.
Use the Counter package to count the words, which facilitates identifying the top 5000 common words.
1 2 3 4 5 6 7 8 9 10 11 12 13
# join all the sentence together # e.g. ['today is good', 'today is bad'] => ['today is good today is bad'] text = ' '.join(train_dataset)
# use spacy to tokenize the sentence doc = nlp(text)
# filter out the punctuation and stop words word_freq = Counter(token.text for token in doc \ ifnot token.is_punct and \ not token.is_stop and \ not token.is_space ) word_freq
Next, we can select the top 5000 words based on the number of times they appear:
1 2 3 4
# select the top 5000 most common words most_common_words = word_freq.most_common(5000) # Build a dictionary mapping words to indexes e.g. {'hello':0, 'like':1 ...} vocab = {word[0]: idx for idx, word inenumerate(most_common_words)}
Convert Sentences to Tensors
With the vocab dictionary at hand, we can now convert sentences into an index format based on this dictionary. For example:
Original sentence: I like apple
Converted into index format: [100, 3923, 123]
But what if we encounter a word that we don’t understand or is not included in the dictionary?
Here, we also need a placeholder_index.
When a word in our sentence is not in the vocab dictionary, we convert that word to the placeholder_index.
We set this as 5000, representing an unrecognizable word. For example:
# Convert words to indexes, and use the placeholder index 5000 for words that are not in the vocabulary table placeholder_index = 5000 # Store the result of the entire dataset converted to index indexed_dataset = [] # Iterate over the entire dataset for tweet in train_dataset: # Build an empty list to store the results of the current sentence (e.g. I like apple -> [100, 3923, 123]) indexed_words = [] # Use spacy to split the sentence into words for token in nlp(tweet): # filter out the punctuation and stop words and space ifnot token.is_punct andnot token.is_stop andnot token.is_space: word = token.text # If the word is in the top 5000 words in the vocab, convert it to its index if word in vocab: indexed_words.append(vocab[word]) # Otherwise, convert it to the index of the placeholder token else: indexed_words.append(placeholder_index) indexed_dataset.append(indexed_words)
Base on the above, we can wrap the above code into a function, which will be convenient for us to convert the sentence into an index format later during training:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# for sentence to sequence defprepare_sentence_sequence(seq, to_ix): idx = [] # use spacy to tokenize the sentence for token in nlp(seq): # filter out the punctuation and stop words and space ifnot token.is_punct andnot token.is_stop andnot token.is_space: word = token.text # if the token is in the top 5000 words in the vocab, add its index to the list if word in to_ix: idx.append(to_ix[word]) else: # else add the index of the placeholder token idx.append(placeholder_index) return torch.tensor(idx, dtype=torch.long) # list convert to tensor
將標籤轉換成 tensor
接下來,我們要處理標籤,標籤也需要轉換成向量,這樣 model 的 ouput 才可以與 正確解答 做比較:
為了可以把「模型產生的結果」 [0.1, 0.2, 0.3, 0.4] 和 「正確解答」[1,0,0,0] 放入 loss function 中計算 loss
因此我們需要一個函式,把標籤轉換成向量的形式,這個函示就是 one_hot_encode。
1 2 3 4 5 6 7 8 9 10 11 12 13 14
# val is the index of the label (e.g. 2); to_ix is the dictionary of the label (e.g. {0:'angry', 1:'happy'}) defone_hot_encode(val, to_ix): # create an empty list to store the result result = [] # iterate over the dictionary of the label for k, v in to_ix.items(): # if the index of the label is the same as the index of the dictionary, we found the correct label if val == k: # append 1 to the list result.append(1) else: # append 0 to the list if the index is not the same result.append(0) return torch.tensor(result, dtype=torch.float32) # convert list to tensor
After the above, we can wrap the above code into a function, which will be convenient for us to convert the sentence into an index format later during training:
1 2 3 4
# Because the label is a number, we need to convert it to a vector mapping = dict(zip(mapping_pd[0], mapping_pd[1])) # Return {0:'angry', 1:'happy', 2:'optimism', 3:'sadness'} print(mapping) print(f'ans=2; vector={one_hot_encode(2, tag_to_ix)}')
Result: We successfully converted 2 into the vector [0, 0, 1, 0]!
Train the Model and Calculate Accuracy: Evaluate the accuracy on the test set. (Note: If the training takes too long, try to use only a fraction of the training data.)
Try Your Hand
Before starting to train the model, we need to first understand what our model’s input and output look like. Let’s see what the model predicts before it’s trained!
# See what the scores are before training # Here we don't need to train, so the code is wrapped in torch.no_grad() # Take the first sentence as an example sentence_idx = 1 # Print out:My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs print(f'First Sentense = {train_dataset[sentence_idx]}')
with torch.no_grad(): # Convert the first sentence into index format, and convert it to tensor inputs = prepare_sentence_sequence(train_dataset[sentence_idx], word_to_ix) print(f'Sentense to tensor = {inputs}') # 印出:tensor([1070, 340, 2015, 2016, 45, 2017]) # Then convert the answer to tensor labels = one_hot_encode(train_label_pd[0][sentence_idx], tag_to_ix) print(f'Sentense of result to tensor = {labels}') # 印出:tensor([1., 0., 0., 0.]) # Send the inputs to the model and get the model's prediction outputs = model(inputs) print(f'tag_scores = {outputs}') # Print:tensor([[-1.3280, -1.4272, -1.4998, -1.3026]])
# Take the maximum probability value and take out the index _, preds = torch.max(outputs, 1) print(f'preds = {preds}') # Print out: preds = tensor([3])
# Take out the index of the maximum probability value result_idx = torch.argmax(outputs).item() print(f'result = {result_idx}, ans = {train_label_pd[0][sentence_idx]}') # 印出:result = 3, ans = 0
# Calculate loss to see the difference between output and label. Here output[0] is because we found that output has one more layer loss = loss_function(outputs[0], labels) print(f'loss = {loss}')
Result
1 2 3 4 5 6 7
First Sentense = My roommate: it's okay that we can't spell because we have autocorrect. #terrible #firstworldprobs Sentense to tensor = tensor([1070, 340, 2015, 2016, 45, 2017]) Sentense of result to tensor = tensor([1., 0., 0., 0.]) tag_scores = tensor([[-1.3280, -1.4272, -1.4998, -1.3026]]) loss = 1.32795250415802 preds = tensor([3]) result = 3, ans = 0
Looks like it’s running pretty smoothly, right?
Here we go!
Task 4: Train the Model and Calculate Accuracy
Train the Model and Calculate Accuracy: Evaluate the accuracy on the test set. (Note: If the training takes too long, try to use only a fraction of the training data.)
Try Your Hand
Before starting to train the model, we need to first understand what our model’s input and output look like. Let’s see what the model predicts before it’s trained!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Epoch 0/29 ---------- train Loss: 1.2157 Acc: 0.4642 Time elapsed: 25 sec. test Loss: 1.2095 Acc: 0.4553 Time elapsed: 32 sec.
Epoch 1/29 ---------- train Loss: 1.1019 Acc: 0.5333 Time elapsed: 58 sec. test Loss: 1.1816 Acc: 0.4708 Time elapsed: 65 sec.
Epoch 2/29 ---------- train Loss: 1.0151 Acc: 0.5812 Time elapsed: 92 sec. test Loss: 1.1603 Acc: 0.4898 Time elapsed: 99 sec. ... Training complete in17m 5s Best val Acc: 0.599578 #
Does the above code look familiar? Yes, it does! If you have followed this article Flower102 Dataset - Training with Transfer Learning + Batch Normalization for CNN it uses the same kind of training. , the same kind of training is used.
It is possible to observe both the training results and the testing results to see if there is any overfitting.
Even if there is overfitting, this method can still preserve the best model.
So let’s get started with the train_model function, and I’ll indicate where we need to change it by !!!! to indicate where we need to make changes:
deftrain_model(model, criterion, optimizer, scheduler, num_epochs=1): # The time when training starts since = time.time() # Create a temporary folder to store the best model with TemporaryDirectory() as tempdir: # The path where the best model is stored best_model_params_path = os.path.join(tempdir, 'best_model_params.pt') # Initially store the best model torch.save(model.state_dict(), best_model_params_path) # The current best accuracy, which will be updated if a better accuracy is found best_acc = 0.0
# Start training for n epochs for epoch inrange(num_epochs): print(f'Epoch {epoch}/{num_epochs - 1}') print('-' * 10)
# Each epoch has a training and validation phase for phase in ['train', 'test']: if phase == 'train': model.train() else: model.eval() running_loss = 0.0 running_corrects = 0
# Iterate over data. forinput, label inzip(dataloaders[phase], resultloaders[phase]): # ===== !!! Here !!! ====== # Here we will use the functions created in Task 2+3 to convert sentences to indices and labels to vectors # e.g., tensor([1070, 340, 2015, 2016, 45, 2017]) inputs_vector = prepare_sentence_sequence(input, word_to_ix) # e.g., tensor([1., 0., 0., 0.]) labels_vector = one_hot_encode(label, tag_to_ix) # ===== !!! End !!! ======
# zero the parameter gradients optimizer.zero_grad()
# forward # track history if only in train with torch.set_grad_enabled(phase == 'train'): # Similar to the earlier test # Get the predicted result tensor for each emotion outputs = model(inputs_vector) # (e.g., tensor([[-1.3948, -1.4476, -1.3804, -1.3261]]))
# ===== !!! Here !!! ====== # Get the index of the maximum value pred = torch.argmax(outputs).item() # (e.g., 2) # For the outer layer, only need to calculate the loss between the inner layer [-1.3948, -1.4476, -1.3804, -1.3261] and [0, 0, 1, 0] loss = criterion(outputs[0], labels_vector) # # ===== !!! End !!! ======
# backward + optimize only if in training phase if phase == 'train': loss.backward() optimizer.step()
# statistics running_loss += loss.item() if pred == label: running_corrects += 1
if phase == 'train': scheduler.step() # Calculate the loss and accuracy for each epoch epoch_loss = running_loss / dataset_sizes[phase] epoch_acc = running_corrects / dataset_sizes[phase] print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f} Time elapsed: {round((time.time() - since))} sec.') # If a better accuracy is found, save the model if phase == 'test'and epoch_acc > best_acc: best_acc = epoch_acc torch.save(model.state_dict(), best_model_params_path)
print()
time_elapsed = time.time() - since print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s') print(f'Best val Acc: {best_acc:4f}')
# Load best model weights then proceed to the next epoch model.load_state_dict(torch.load(best_model_params_path)) return model
You will find that there are not many places to change… at most:
1 2 3 4 5 6 7 8 9 10
... # the input and label conversion inputs_vector = prepare_sentence_sequence(input, word_to_ix) labels_vector = one_hot_encode(label, tag_to_ix) ... # Then get the prediction result tensor for each emotion pred = torch.argmax(outputs).item() # Calculate the loss loss = criterion(outputs[0], labels_vector) ...
Now we can start training the model!
Training
Let’s first prepare the dataset for training:
1 2 3 4
# Before we do that, let's prepare the dataset for the model to use dataloaders = {'train': train_dataset, 'test': test_dataset} resultloaders = {'train': train_label_pd[0].tolist(), 'test': test_label_pd[0].tolist()} dataset_sizes = {x: len(dataloaders[x]) for x in ['train', 'test']}
Firstly, let’s build the LSTM model!
1 2 3 4 5 6 7 8 9
# Build the model # vocab_size need to add 1 because if there are words in the sentence that are not in the vocab, use 5000 to replace them, so you need to add 1 model_LSTM = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix)+1, len(tag_to_ix), dropout=0.5) loss_function_LSTM = nn.CrossEntropyLoss() optimizer_LSTM = optim.SGD(model_LSTM.parameters(), lr=0.001, momentum=0.9) exp_lr_scheduler_LSTM = lr_scheduler.StepLR(optimizer_LSTM, step_size=7, gamma=0.1)
# Start training modelLSTM = train_model(model_LSTM, loss_function_LSTM, optimizer_LSTM, exp_lr_scheduler_LSTM, num_epochs=30)
Result
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Epoch 2/29 ---------- train Loss: 0.9885 Acc: 0.5840 Time elapsed: 97 sec. test Loss: 1.1279 Acc: 0.5236 Time elapsed: 104 sec.
Epoch 3/29 ---------- train Loss: 0.8893 Acc: 0.6371 Time elapsed: 132 sec. test Loss: 1.1053 Acc: 0.5369 Time elapsed: 139 sec.
Epoch 4/29 ---------- train Loss: 0.7683 Acc: 0.7003 Time elapsed: 168 sec. test Loss: 1.0772 Acc: 0.5658 Time elapsed: 175 sec. ... test Loss: 1.1330 Acc: 0.6059 Time elapsed: 1040 sec.
Training complete in17m 20s Best val Acc: 0.610134
Then let’s build the GRU model!
1 2 3 4 5 6 7 8
# vocab_size need to add 1 because if there are words in the sentence that are not in the vocab, use 5000 to replace them, so you need to add 1 modelGRU = GRUTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix)+1, len(tag_to_ix), dropout=0.5) loss_function_gru = nn.CrossEntropyLoss() optimizer_gru = optim.SGD(modelGRU.parameters(), lr=0.001, momentum=0.9) exp_lr_scheduler_gru = lr_scheduler.StepLR(optimizer_gru, step_size=7, gamma=0.1)
# Start training modelGRU = train_model(modelGRU, loss_function_gru, optimizer_gru, exp_lr_scheduler_gru, num_epochs=30)
Result
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Epoch 3/29 ---------- train Loss: 0.8445 Acc: 0.6702 Time elapsed: 131 sec. test Loss: 1.1211 Acc: 0.5327 Time elapsed: 138 sec.
Epoch 4/29 ---------- train Loss: 0.6843 Acc: 0.7393 Time elapsed: 166 sec. test Loss: 1.1305 Acc: 0.5707 Time elapsed: 173 sec. ... test Loss: 1.3237 Acc: 0.6073 Time elapsed: 1003 sec.
Training complete in16m 43s Best val Acc: 0.608726
I am a graduate student at the Taiwan University of Science and Technology - Institute of Information Management, planning to pursue a dual-degree program in Germany this year. The reason behind setting up this website is my rather short memory span. While I have learned a lot, I tend to forget much of it. To combat forgetfulness, I have embarked on my note-taking journey.