Titanic Dataset - Building a Neural Network with PyTorch + Testing for Overfitting
Reference
Introduction
Recently, I enrolled in an AI course that included an assignment to build a Neural Network and use the Titanic Dataset for training. The task was to implement overfitting by increasing hidden layers and neurons and then mitigate overfitting using dropout or other methods.
This article documents the process of completing the assignment.
Environment Setup and Assignment Requirements
Environment Setup:
- Python 3.10.9
- PyTorch 2.0.1
Assignment Requirements
- Write a custom dataset class for the Titanic data (see the data folder on GitHub). Use only the features: “Pclass,” “Age,” “SibSp,” “Parch,” “Fare,” “Sex,” and “Embarked.” Preprocess the features accordingly in that class (scaling, one-hot-encoding, etc.), and split the data into train and validation data (80% and 20%). The constructor of that class should look like this:
1
2titanic_train = TitanicDataSet('titanic.csv', train=True)
titanic_val = TitanicDataSet('titanic.csv', train=False) - Build a neural network with one hidden layer of size 3 that predicts the survival of the passengers. Use a BCE loss (Hint: you need a sigmoid activation in the output layer). Use a data loader to train in batches of size 16 and shuffle the data.
- Evaluate the performance of the model on the validation data using accuracy as a metric.
- Create the following plot that was introduced in the lecture.
- Increase the complexity of the network by adding more layers and neurons and see if you can overfit on the training data.
- Try to remove overfitting by introducing a dropout layer.
In Simple Terms
In simple terms, we will satisfy the above requirements through the following four steps:
-
Data Preprocessing
Task 1
: Build a class and import Titanic data.Task 1
: Select specific columns as training features.Task 1
: Data preprocessing (scaling, one-hot encoding, etc.) to convert non-numeric columns like “Sex” or “Embarked” into numeric values.Task 1
: Split the data into train and validation data (80% and 20%).Task 1
: Create a class and import the data.
-
Build a Neural Network
Task 2
: Build a three-layer network (1 input layer + 1 hidden layer + 1 output layer).Task 2
: The size of the first hidden layer is 3.Task 2
: Use BCE loss as the loss function.Task 2
: Use sigmoid activation as the output layer’s activation function.
-
Model Training
Task 3
: Start training the model and record accuracy at each step.
-
Generate Results
Task 4
: Generate results and create plots.
-
Create Overfitting
Task 5
: Increase hidden layers and neurons to induce overfitting.
-
Use Dropout
Task 6
: Use dropout or other methods to mitigate the impact of overfitting.
Step 1: Data Preprocessing
Let’s start with data preprocessing:
Task 1
: Build a class and import Titanic data.Task 1
: Select specific columns as training features.Task 1
: Data preprocessing (scaling, one-hot encoding, etc.) to convert non-numeric columns like “Sex” or “Embarked” into numeric values.Task 1
: Split the data into train data and validation data (80% and 20%).Task 1
: Create a class and import the data.
1.1 Data Preprocessing
-
First, we’ll import all the necessary packages.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25import os
# data process
import numpy as np
import pandas as pd
from skimage import io, transform
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils
# plot
import matplotlib.pyplot as plt
# neural network
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# preprocessing
from sklearn.preprocessing import MinMaxScaler,OneHotEncoder
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore")
plt.ion() # interactive mode -
Before we start, I’d like to place all the parameters we’ll use at the top for easier modification:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18# Shared variables
D_in, D_out = 10, 1
num_epochs = 250
log_interval = 100
# Batch size: the amount of data for each training iteration
batch_size = 30
# Learning rate: Since we will create two different networks, we set two different learning rates
learning_rate = 0.001
multi_learning_rate = 0.001
# Hidden layers
multi_num_layers = 6
# Hidden neurons: Since we will create two different networks, we set two different numbers of hidden neurons
neurons = 3
multi_neurons = 1024 -
Next, we’ll create a class based on the requirements and load the Titanic data, returning features:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65class TitanicDataset(Dataset):
# Initialization function for loading and preprocessing data
def __init__(self, root_dir, train=True, transform=None):
# The train parameter indicates whether it's training data or test data
self.train = train
# The transform parameter is used to define a transformation function if needed
# Create MinMaxScaler and OneHotEncoder for data preprocessing
minmax_scaler = MinMaxScaler()
onehot_enc = OneHotEncoder()
# Read the Titanic data from the CSV file
titanic = pd.read_csv(root_dir)
# Select specific columns from the data
titanic = titanic[["Pclass", "Age", "SibSp", "Parch", "Fare", "Sex", "Embarked", "Survived"]]
# Fill missing values in the "Age" column with the mean and drop rows with missing values
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].mean())
titanic = titanic.dropna()
titanic = titanic.reset_index(drop=True)
# Split the data into categorical features, numerical features, and labels
categorical_features = titanic[titanic.select_dtypes(include=['object']).columns.tolist()]
numerical_features = titanic[titanic.select_dtypes(exclude=['object']).columns].drop('Survived', axis=1)
label_features = titanic['Survived']
# Normalize numerical features (MinMax scaling)
numerical_features_arr = minmax_scaler.fit_transform(numerical_features)
# Perform one-hot encoding on categorical features
categorical_features_arr = onehot_enc.fit_transform(categorical_features).toarray()
# Merge the normalized numerical features and one-hot encoded categorical features into one dataset
combined_features = pd.DataFrame(data=numerical_features_arr, columns=numerical_features.columns)
combined_features = pd.concat([combined_features, pd.DataFrame(data=categorical_features_arr)], axis=1)
combined_features = pd.concat([combined_features, label_features], axis=1).reset_index(drop=True)
# Split the dataset into training and test sets
train_data, test_data = train_test_split(combined_features, test_size=0.2, random_state=42)
# Choose the data to use based on training or testing mode
if train:
self.data = train_data
else:
self.data = test_data
# Return the length of the dataset
def __len__(self):
return len(self.data)
# Function for training the neural network, returns features and labels
def __getitem__(self, idx):
# Get the data from the self.data DataFrame for the idx-th row
sample = self.data.iloc[idx]
# Convert a data structure to a PyTorch tensor and specify the data type as float
features = torch.FloatTensor(sample[:-1])
label = torch.FloatTensor([sample['Survived']])
if self.transform:
features = self.transform(features)
return features, label
# Return the entire dataset as a DataFrame
def getData(self):
return self.data -
After writing the function, you can use the following commands to test it:
1
2
3
4
5
6
7
8
9
10
11titanic_train = TitanicDataset('./data/titanic.csv', train=True)
titanic_val = TitanicDataset('./data/titanic.csv', train=False)
print('train_dataset len:', len(titanic_train))
print('val_dataset len:', len(titanic_val))
print('total_dataset len:', len(titanic_train) + len(titanic_val))
# Output
'''
train_dataset len: 711
val_dataset len: 178
total_dataset len: 889
''' -
可以透過以下程式,列印出以下結果: Or using the following code to print the following results:
1
titanic_val.getData()
Step2. Building the Neural Network
- Next, we will construct the following Neural Network, primarily performing the following tasks:
__init__
: Create a three-layer network (1 input layer + 1 hidden layer + 1 output layer).D_in
: Neurons size of the input layer.H
: Neurons size of the hidden layer.D_out
: Neurons size of the output layer.
forward
: The place for performing the forward pass, primarily performing the linear transformation of the first layer with therelu
activation function, the linear transformation of the second layer with thesigmoid
activation function, and finally returning the prediction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20class TwoLayerNet(nn.Module):
def __init__(self, D_in, H, D_out):
"""
In the constructor we instantiate two nn.Linear modules and assign them as member variables.
"""
super(TwoLayerNet, self).__init__()
# the weight and bias of linear1 will be initialized
# you can access them by self.linear1.weight and self.linear1.bias
self.linear1 = nn.Linear(D_in, H) # this will create weight, bias for linear1
self.linear2 = nn.Linear(H, D_out) # this will create weight, bias for linear2
self.sigmoid = nn.Sigmoid() # Sigmoid activation for binary classification
def forward(self, x):
"""
In the forward function we accept a Tensor of input data and we must return a Tensor of output data.
We can use Modules defined in the constructor as well as arbitrary operators on Tensors.
"""
h_relu = F.relu(self.linear1(x))
y_pred = self.sigmoid(self.linear2(h_relu))
return y_pred - Before training the model, we need to build the neural network. The following code sets the batch size to 16, which means we train with 16 samples at a time. We train on all 889 data points in a single epoch. The input layer has 10 neurons, the hidden layer has 3 neurons, the output layer has 1 neuron, and the learning rate is set to 0.001. We train for a total of 500 epochs.
- We create the network to construct the neural network.
- We use Adam as the optimizer for gradient descent updates.
- We use Binary Cross-Entropy Loss as the loss function.
1
2
3
4
5
6
7
8N, D_in, H, D_out = 16, 10, 3, 1
lr = 0.001
n_epochs = 50
log_interval = 100 # Print the training status every log_interval epoch
network = TwoLayerNet(D_in, H, D_out) # H=3 for one hidden layer with 3 neurons
optimizer = optim.Adam(network.parameters(), lr)
criterion = nn.BCELoss() # Define the loss function as Binary Cross-Entropy Loss
Step3. Model Training
-
You can start by creating lists to keep track of the loss and accuracy for each training loop (epoch) of the neural network model during the training process:
1
2
3
4train_losses = [] # Save the loss value of each training loop (epoch) of the neural network model during the training process
train_counter = [] # Save the number of images for training so far
test_losses = [] # Save the loss value of each test loop (epoch) of the neural network model during the training process
test_counter = [i * len(titanic_train) for i in range(n_epochs + 1)] # how many data for training so far -
Create a training function with the main purpose of training the model using the train dataset:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36def train(epoch): # Indicates the current epoch being run
network.train() # Use the network created in the previous step for training
correct = 0 # Record the current number of correct predictions
cur_count = 0 # Record how many data points have been trained so far
for batch_idx, (data, target) in enumerate(train_dataloader):
optimizer.zero_grad() # Clear the gradient to start fresh for each batch because the gradient is updated after each batch
# Forward propagation
output = network(data) # Feed the data into the network for forward propagation
loss = criterion(output, target) # Calculate the loss
# Accuracy
pred = (output >= 0.5).float() # Since the answers are either 0 or 1, we need to set a threshold where >= 0.5 is 1 and < 0.5 is 0
correct += (pred == target).sum().item() # Record the current number of correct predictions
cur_count += len(data) # Record how many data points have been trained so far
# Backward propagation
loss.backward() # Calculate the gradient of the loss
optimizer.step() # Update the gradient
if batch_idx % log_interval == 0: # Print the training status every log_interval
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy: {}/{} ({:.0f}%)'.format(
epoch,
cur_count,
len(train_dataloader.dataset),
100. * cur_count / len(train_dataloader.dataset),
loss.item(),
correct, len(train_dataloader.dataset),
100. * correct / len(train_dataloader.dataset))
)
train_losses.append(loss.item())
train_counter.append((batch_idx * 16) + ((epoch - 1) * len(train_dataloader.dataset)))
# Return the current accuracy
return correct / len(train_dataloader.dataset) -
Constructing a test function, the main purpose is to test the trained model through validation dataset to see how accurate the model is when it is trained to detect unknown data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22def test():
network.eval() # Use the network created in the previous step and indicate that it's for evaluation
test_loss = 0 # Record the current loss
correct = 0 # Record the current number of correct predictions
with torch.no_grad(): # We don't need to calculate gradients for evaluation, so we can use torch.no_grad() for speedup
for data, target in test_dataloader: # Get data through the test_dataloader
# Forward propagation
output = network(data) # Feed the data into the trained network for forward propagation
test_loss += criterion(output, target).item() # Calculate the loss
# Accuracy
pred = (output >= 0.5).float() # 0.5 is the threshold
correct += (pred == target).sum().item() # Record the current number of correct predictions
test_loss /= len(test_dataloader.dataset) # Calculate the average loss
test_losses.append(test_loss) # Add the current loss to the list
print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss,
correct,
len(test_dataloader.dataset),
1. * correct / len(test_dataloader.dataset))
)
return correct / len(test_dataloader.dataset) # Return the current accuracy -
Finally, we can train the model according to the number of epochs, and check the training status by
test()
after each epoch.1
2
3
4
5
6test()
train_accuracy_list = []
test_accuracy_list = []
for epoch in range(1, n_epochs + 1): # Indicates the current epoch being run
train_accuracy_list.append(train(epoch)) # After each epoch, we use the train() function to train the model
test_accuracy_list.append(test()) # After each epoch, we use the test() function to test the model
Step4. Generate Results
Finally, we can generate the results by using the following commands:
1 | import matplotlib.pyplot as plt |
Step5. Make Overfitting
Creating overfitting can mainly be achieved through a few methods:
- Increasing the number of epochs can lead to some overfitting.
- Increasing the number of hidden layers or increasing the neuron size can also result in some overfitting.
Since the task requires increasing the number of hidden layers and increasing the neuron size, let’s give it a try! The simplest way is to increase the number of hidden layers, increase the neuron size, and also increase the number of epochs to observe overfitting.
- Create a MultiLayerNet and increase the number of hidden layers and neuron size.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21class MultiLayerNet(nn.Module):
def __init__(self, D_in, H, D_out, num_layers):
super(MultiLayerNet, self).__init__()
neurons = 128
self.input = nn.Linear(D_in, H)
self.linear1 = nn.Linear(H, 128)
self.linear2 = nn.Linear(128, 64)
self.linear3 = nn.Linear(64, 32)
self.linear4 = nn.Linear(32, 16)
self.output = nn.Linear(16, D_out)
self.sigmoid = nn.Sigmoid() # Sigmoid activation for binary classification
def forward(self, x):
y_relu = F.relu(self.input(x))
y_relu = F.relu(self.linear1(y_relu))
y_relu = F.relu(self.linear2(y_relu))
y_relu = F.relu(self.linear3(y_relu))
y_relu = F.relu(self.linear4(y_relu))
y_pred = self.sigmoid(self.output(y_relu))
return y_pred
Note: If you simply add layers, you may not see much learning effect, and you will always see a flat line… and the accuracy won’t improve! Later, a classmate found that decreasing the number of neurons gradually would lead to better learning. So, we can set the neurons as 128, 64, 32, 16!
A classmate said: “It’s like an hourglass,” filtering out unimportant information step by step and leaving behind the important information!
-
Create new
multi_train()
andtest_multi()
functions for the multi-network.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59def train_multi(epoch):
multi_network.train() # Use the network created in the previous step for training
correct = 0
cur_count = 0
for batch_idx, (data, target) in enumerate(train_dataloader):
multi_optimizer.zero_grad()
# forward propagation
# You will find that we use multi_network here for forward propagation
output = multi_network(data)
loss = multi_criterion(output, target)
# Accuracy
pred = (output >= 0.5).float() # survival_rate is the threshold
correct += (pred == target).sum().item()
cur_count += len(data)
# backword propagation
loss.backward()
multi_optimizer.step()
if batch_idx % log_interval == 0:
print('Muti Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy: {}/{} ({:.0f}%)'.format(
epoch,
cur_count,
len(train_dataloader.dataset),
100. * cur_count / len(train_dataloader.dataset),
loss.item(),
correct, len(train_dataloader.dataset),
100. * correct / len(train_dataloader.dataset))
)
train_losses.append(loss.item())
train_counter.append((batch_idx*16) + ((epoch-1)*len(train_dataloader.dataset)))
return correct / len(train_dataloader.dataset)
def test_multi():
multi_network.eval()
test_loss = 0
correct = 0
with torch.no_grad():
for data, target in test_dataloader:
# forward propagation
output = multi_network(data)
test_loss += multi_criterion(output, target).item()
# Accuracy
pred = (output >= 0.5).float() # 0.5 is the threshold
correct += (pred == target).sum().item()
test_loss /= len(test_dataloader.dataset)
test_losses.append(test_loss)
print('\nMulti Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
test_loss,
correct,
len(test_dataloader.dataset),
100. * correct / len(test_dataloader.dataset))
)
return correct / len(test_dataloader.dataset) -
Retrain model
1
2
3
4
5
6
7
8test_multi()
multi_train_accuracy_list = []
multi_test_accuracy_list = []
for epoch in range(1, n_epochs + 1):
multi_train_accuracy_list.append(train_multi(epoch))
multi_test_accuracy_list.append(test_multi()) -
Draw the plot again: You can try to increase the number of epochs to 500, and you will see the overfitting phenomenon!
1
2
3
4
5
6
7import matplotlib.pyplot as plt
plt.plot(multi_train_accuracy_list, color='orange')
plt.plot(multi_test_accuracy_list, color='green')
plt.ylim(0.5, 0.9)
plt.legend(['Train Accuracy', 'Test Accuracy', 'Mutli Train Accuracy', 'Mutli Test Accuracy'], loc='upper right')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
Advanced Version
If you wish to dynamically adjust the number of neurons and hidden layers, you can use the following approach:
neurons
: The initial number of neurons. If set to 1024, it will decrease from 1024 to 16, dividing by 2 each time, until the number of neurons is less than 16.num_layers
: The number of hidden layers.
1 | neurons = 1024 |
Step6. Use Dropout
Here we can use dropout to avoid overfitting, mainly by randomly turning off some neurons during forward propagation, so as to avoid overfitting.
1 | import torch.nn.functional as F |
At this point, you will find that the overfitting phenomenon is not so serious! Here are the results when the epoch number is set to 200.
Without Dropout
With Dropout
Advanced
The difference in the advanced version is that the number of dropout layers is the same as the number of hidden layers, and the number of dropout layers decreases with the number of hidden layers.
1 | class MultiLayerNet(nn.Module): |