Reference

Ref: A Detailed Explanation of the Titanic Dataset Structure

Introduction

Recently, I enrolled in an AI course that included an assignment to build a Neural Network and use the Titanic Dataset for training. The task was to implement overfitting by increasing hidden layers and neurons and then mitigate overfitting using dropout or other methods.

This article documents the process of completing the assignment.

Environment Setup and Assignment Requirements

Environment Setup:

Python 3.10.9

PyTorch 2.0.1

Assignment Requirements

Write a custom dataset class for the Titanic data (see the data folder on GitHub). Use only the features: “Pclass,” “Age,” “SibSp,” “Parch,” “Fare,” “Sex,” and “Embarked.” Preprocess the features accordingly in that class (scaling, one-hot-encoding, etc.), and split the data into train and validation data (80% and 20%). The constructor of that class should look like this:
1
2
titanic_train = TitanicDataSet('titanic.csv', train=True)
titanic_val = TitanicDataSet('titanic.csv', train=False)
Build a neural network with one hidden layer of size 3 that predicts the survival of the passengers. Use a BCE loss (Hint: you need a sigmoid activation in the output layer). Use a data loader to train in batches of size 16 and shuffle the data.
Evaluate the performance of the model on the validation data using accuracy as a metric.
Create the following plot that was introduced in the lecture.
Increase the complexity of the network by adding more layers and neurons and see if you can overfit on the training data.
Try to remove overfitting by introducing a dropout layer.

In Simple Terms

In simple terms, we will satisfy the above requirements through the following four steps:

Data Preprocessing
- Task 1: Build a class and import Titanic data.
- Task 1: Select specific columns as training features.
- Task 1: Data preprocessing (scaling, one-hot encoding, etc.) to convert non-numeric columns like “Sex” or “Embarked” into numeric values.
- Task 1: Split the data into train and validation data (80% and 20%).
- Task 1: Create a class and import the data.
Build a Neural Network
- Task 2: Build a three-layer network (1 input layer + 1 hidden layer + 1 output layer).
- Task 2: The size of the first hidden layer is 3.
- Task 2: Use BCE loss as the loss function.
- Task 2: Use sigmoid activation as the output layer’s activation function.
Model Training
- Task 3: Start training the model and record accuracy at each step.
Generate Results
- Task 4: Generate results and create plots.
Create Overfitting
- Task 5: Increase hidden layers and neurons to induce overfitting.
Use Dropout
- Task 6: Use dropout or other methods to mitigate the impact of overfitting.

Step 1: Data Preprocessing

Let’s start with data preprocessing:

Task 1: Build a class and import Titanic data.
Task 1: Select specific columns as training features.
Task 1: Data preprocessing (scaling, one-hot encoding, etc.) to convert non-numeric columns like “Sex” or “Embarked” into numeric values.
Task 1: Split the data into train data and validation data (80% and 20%).
Task 1: Create a class and import the data.

1.1 Data Preprocessing

First, we’ll import all the necessary packages.

import os
# data process 
import numpy as np
import pandas as pd
from skimage import io, transform
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, utils

# plot 
import matplotlib.pyplot as plt

# neural network 
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# preprocessing
from sklearn.preprocessing import MinMaxScaler,OneHotEncoder
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")

plt.ion()   # interactive mode

Before we start, I’d like to place all the parameters we’ll use at the top for easier modification:

# Shared variables
D_in, D_out = 10, 1 
num_epochs = 250 
log_interval = 100 

# Batch size: the amount of data for each training iteration
batch_size = 30

# Learning rate: Since we will create two different networks, we set two different learning rates
learning_rate = 0.001
multi_learning_rate = 0.001

# Hidden layers
multi_num_layers = 6

# Hidden neurons: Since we will create two different networks, we set two different numbers of hidden neurons
neurons = 3 
multi_neurons = 1024

Next, we’ll create a class based on the requirements and load the Titanic data, returning features:

class TitanicDataset(Dataset):

    # Initialization function for loading and preprocessing data
    def __init__(self, root_dir, train=True, transform=None):
        # The train parameter indicates whether it's training data or test data
        self.train = train
        # The transform parameter is used to define a transformation function if needed

        # Create MinMaxScaler and OneHotEncoder for data preprocessing
        minmax_scaler = MinMaxScaler()
        onehot_enc = OneHotEncoder()

        # Read the Titanic data from the CSV file
        titanic = pd.read_csv(root_dir)
        # Select specific columns from the data
        titanic = titanic[["Pclass", "Age", "SibSp", "Parch", "Fare", "Sex", "Embarked", "Survived"]]

        # Fill missing values in the "Age" column with the mean and drop rows with missing values
        titanic["Age"] = titanic["Age"].fillna(titanic["Age"].mean())
        titanic = titanic.dropna()
        titanic = titanic.reset_index(drop=True)

        # Split the data into categorical features, numerical features, and labels
        categorical_features = titanic[titanic.select_dtypes(include=['object']).columns.tolist()]
        numerical_features = titanic[titanic.select_dtypes(exclude=['object']).columns].drop('Survived', axis=1)
        label_features = titanic['Survived']

        # Normalize numerical features (MinMax scaling)
        numerical_features_arr = minmax_scaler.fit_transform(numerical_features)

        # Perform one-hot encoding on categorical features
        categorical_features_arr = onehot_enc.fit_transform(categorical_features).toarray()

        # Merge the normalized numerical features and one-hot encoded categorical features into one dataset
        combined_features = pd.DataFrame(data=numerical_features_arr, columns=numerical_features.columns)
        combined_features = pd.concat([combined_features, pd.DataFrame(data=categorical_features_arr)], axis=1)
        combined_features = pd.concat([combined_features, label_features], axis=1).reset_index(drop=True)

        # Split the dataset into training and test sets
        train_data, test_data = train_test_split(combined_features, test_size=0.2, random_state=42)

        # Choose the data to use based on training or testing mode
        if train:
            self.data = train_data
        else:
            self.data = test_data

    # Return the length of the dataset
    def __len__(self):
        return len(self.data)

    # Function for training the neural network, returns features and labels
    def __getitem__(self, idx):
        # Get the data from the self.data DataFrame for the idx-th row
        sample = self.data.iloc[idx]
        # Convert a data structure to a PyTorch tensor and specify the data type as float
        features = torch.FloatTensor(sample[:-1])
        label = torch.FloatTensor([sample['Survived']])
        if self.transform:
            features = self.transform(features)
        return features, label

    # Return the entire dataset as a DataFrame
    def getData(self):
        return self.data

After writing the function, you can use the following commands to test it:

titanic_train = TitanicDataset('./data/titanic.csv', train=True)
titanic_val = TitanicDataset('./data/titanic.csv', train=False)
print('train_dataset len:', len(titanic_train))
print('val_dataset len:', len(titanic_val))
print('total_dataset len:', len(titanic_train) + len(titanic_val))
# Output 
'''
train_dataset len: 711
val_dataset len: 178
total_dataset len: 889
'''

可以透過以下程式，列印出以下結果： Or using the following code to print the following results:
1
titanic_val.getData()

Step2. Building the Neural Network

Next, we will construct the following Neural Network, primarily performing the following tasks:

__init__: Create a three-layer network (1 input layer + 1 hidden layer + 1 output layer).
- D_in: Neurons size of the input layer.
- H: Neurons size of the hidden layer.
- D_out: Neurons size of the output layer.
forward: The place for performing the forward pass, primarily performing the linear transformation of the first layer with the relu activation function, the linear transformation of the second layer with the sigmoid activation function, and finally returning the prediction.

class TwoLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as member variables.
        """
        super(TwoLayerNet, self).__init__()
        # the weight and bias of linear1 will be initialized 
        # you can access them by self.linear1.weight and self.linear1.bias
        self.linear1 = nn.Linear(D_in, H) # this will create weight, bias for linear1
        self.linear2 = nn.Linear(H, D_out) # this will create weight, bias for linear2
        self.sigmoid = nn.Sigmoid() # Sigmoid activation for binary classification

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return a Tensor of output data.
        We can use Modules defined in the constructor as well as arbitrary operators on Tensors.
        """
        h_relu = F.relu(self.linear1(x))
        y_pred = self.sigmoid(self.linear2(h_relu))
        return y_pred

Before training the model, we need to build the neural network. The following code sets the batch size to 16, which means we train with 16 samples at a time. We train on all 889 data points in a single epoch. The input layer has 10 neurons, the hidden layer has 3 neurons, the output layer has 1 neuron, and the learning rate is set to 0.001. We train for a total of 500 epochs.

We create the network to construct the neural network.
We use Adam as the optimizer for gradient descent updates.
We use Binary Cross-Entropy Loss as the loss function.

N, D_in, H, D_out = 16, 10, 3, 1
lr = 0.001
n_epochs = 50
log_interval = 100 # Print the training status every log_interval epoch

network = TwoLayerNet(D_in, H, D_out)  # H=3 for one hidden layer with 3 neurons
optimizer = optim.Adam(network.parameters(), lr)
criterion = nn.BCELoss() # Define the loss function as Binary Cross-Entropy Loss

Step3. Model Training

You can start by creating lists to keep track of the loss and accuracy for each training loop (epoch) of the neural network model during the training process:

train_losses = []  # Save the loss value of each training loop (epoch) of the neural network model during the training process
train_counter = []  # Save the number of images for training so far
test_losses = []   # Save the loss value of each test loop (epoch) of the neural network model during the training process
test_counter = [i * len(titanic_train) for i in range(n_epochs + 1)]  # how many data for training so far

Create a training function with the main purpose of training the model using the train dataset:

def train(epoch):  # Indicates the current epoch being run
    network.train()  # Use the network created in the previous step for training
    correct = 0  # Record the current number of correct predictions
    cur_count = 0  # Record how many data points have been trained so far

    for batch_idx, (data, target) in enumerate(train_dataloader):
        optimizer.zero_grad()  # Clear the gradient to start fresh for each batch because the gradient is updated after each batch

        # Forward propagation
        output = network(data)  # Feed the data into the network for forward propagation
        loss = criterion(output, target)  # Calculate the loss

        # Accuracy
        pred = (output >= 0.5).float()  # Since the answers are either 0 or 1, we need to set a threshold where >= 0.5 is 1 and < 0.5 is 0
        correct += (pred == target).sum().item()  # Record the current number of correct predictions
        cur_count += len(data)  # Record how many data points have been trained so far

        # Backward propagation
        loss.backward()  # Calculate the gradient of the loss
        optimizer.step()  # Update the gradient

        if batch_idx % log_interval == 0:  # Print the training status every log_interval
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy: {}/{} ({:.0f}%)'.format(
                epoch,
                cur_count,
                len(train_dataloader.dataset),
                100. * cur_count / len(train_dataloader.dataset),
                loss.item(),
                correct, len(train_dataloader.dataset),
                100. * correct / len(train_dataloader.dataset))
            )
            train_losses.append(loss.item())
            train_counter.append((batch_idx * 16) + ((epoch - 1) * len(train_dataloader.dataset)))

    # Return the current accuracy
    return correct / len(train_dataloader.dataset)

Constructing a test function, the main purpose is to test the trained model through validation dataset to see how accurate the model is when it is trained to detect unknown data.

def test():
    network.eval()  # Use the network created in the previous step and indicate that it's for evaluation
    test_loss = 0  # Record the current loss
    correct = 0  # Record the current number of correct predictions
    with torch.no_grad():  # We don't need to calculate gradients for evaluation, so we can use torch.no_grad() for speedup
        for data, target in test_dataloader:  # Get data through the test_dataloader
            # Forward propagation
            output = network(data)  # Feed the data into the trained network for forward propagation
            test_loss += criterion(output, target).item()  # Calculate the loss
            # Accuracy
            pred = (output >= 0.5).float()  # 0.5 is the threshold
            correct += (pred == target).sum().item()  # Record the current number of correct predictions

    test_loss /= len(test_dataloader.dataset)  # Calculate the average loss
    test_losses.append(test_loss)  # Add the current loss to the list
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss,
        correct,
        len(test_dataloader.dataset),
        1. * correct / len(test_dataloader.dataset))
    )
    return correct / len(test_dataloader.dataset)  # Return the current accuracy

Finally, we can train the model according to the number of epochs, and check the training status by test() after each epoch.

test()
train_accuracy_list = []
test_accuracy_list = []
for epoch in range(1, n_epochs + 1): # Indicates the current epoch being run
    train_accuracy_list.append(train(epoch)) # After each epoch, we use the train() function to train the model
    test_accuracy_list.append(test()) # After each epoch, we use the test() function to test the model

Step4. Generate Results

Finally, we can generate the results by using the following commands:

import matplotlib.pyplot as plt
plt.plot(train_accuracy_list, color='blue')
plt.plot(test_accuracy_list, color='red')
# plt.ylim(0.5, 1)
plt.legend(['Train Accuracy', 'Test Accuracy', 'Mutli Train Accuracy', 'Mutli Test Accuracy'], loc='lower right')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')

Step5. Make Overfitting

Creating overfitting can mainly be achieved through a few methods:

Increasing the number of epochs can lead to some overfitting.
Increasing the number of hidden layers or increasing the neuron size can also result in some overfitting.

Since the task requires increasing the number of hidden layers and increasing the neuron size, let’s give it a try! The simplest way is to increase the number of hidden layers, increase the neuron size, and also increase the number of epochs to observe overfitting.

Create a MultiLayerNet and increase the number of hidden layers and neuron size.

class MultiLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out, num_layers):
        super(MultiLayerNet, self).__init__()
        neurons = 128
        self.input = nn.Linear(D_in, H)
        self.linear1 = nn.Linear(H, 128)
        self.linear2 = nn.Linear(128, 64)
        self.linear3 = nn.Linear(64, 32)
        self.linear4 = nn.Linear(32, 16)
        self.output = nn.Linear(16, D_out)
        self.sigmoid = nn.Sigmoid() # Sigmoid activation for binary classification


    def forward(self, x):
        y_relu = F.relu(self.input(x))
        y_relu = F.relu(self.linear1(y_relu))
        y_relu = F.relu(self.linear2(y_relu))
        y_relu = F.relu(self.linear3(y_relu))
        y_relu = F.relu(self.linear4(y_relu))
        y_pred = self.sigmoid(self.output(y_relu))
        return y_pred

Note: If you simply add layers, you may not see much learning effect, and you will always see a flat line… and the accuracy won’t improve! Later, a classmate found that decreasing the number of neurons gradually would lead to better learning. So, we can set the neurons as 128, 64, 32, 16!
A classmate said: “It’s like an hourglass,” filtering out unimportant information step by step and leaving behind the important information!

Create new multi_train() and test_multi() functions for the multi-network.

def train_multi(epoch):
    multi_network.train() # Use the network created in the previous step for training
    correct = 0
    cur_count = 0 

    for batch_idx, (data, target) in enumerate(train_dataloader):
        multi_optimizer.zero_grad()
        
        # forward propagation
        # You will find that we use multi_network here for forward propagation
        output = multi_network(data) 
        loss = multi_criterion(output, target) 
                
        # Accuracy
        pred = (output >= 0.5).float()  # survival_rate is the threshold
        correct += (pred == target).sum().item()
        cur_count += len(data)

        # backword propagation
        loss.backward()
        multi_optimizer.step()
        

        if batch_idx % log_interval == 0:
            print('Muti Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\t Accuracy: {}/{} ({:.0f}%)'.format(
                epoch, 
                cur_count, 
                len(train_dataloader.dataset),
                100. * cur_count / len(train_dataloader.dataset), 
                loss.item(), 
                correct, len(train_dataloader.dataset),
                100. * correct / len(train_dataloader.dataset))
            )
            train_losses.append(loss.item())
            train_counter.append((batch_idx*16) + ((epoch-1)*len(train_dataloader.dataset)))
            
    return correct / len(train_dataloader.dataset)

def test_multi():
    multi_network.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_dataloader:
            # forward propagation
            output = multi_network(data)
            test_loss += multi_criterion(output, target).item()
            # Accuracy
            pred = (output >= 0.5).float()  # 0.5 is the threshold
            correct += (pred == target).sum().item()
    test_loss /= len(test_dataloader.dataset)
    test_losses.append(test_loss)
    print('\nMulti Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, 
        correct, 
        len(test_dataloader.dataset),
        100. * correct / len(test_dataloader.dataset))
    )
    return correct / len(test_dataloader.dataset)

Retrain model

test_multi()

multi_train_accuracy_list = []
multi_test_accuracy_list = []

for epoch in range(1, n_epochs + 1):
    multi_train_accuracy_list.append(train_multi(epoch))
    multi_test_accuracy_list.append(test_multi())

Draw the plot again: You can try to increase the number of epochs to 500, and you will see the overfitting phenomenon!

import matplotlib.pyplot as plt
plt.plot(multi_train_accuracy_list, color='orange')
plt.plot(multi_test_accuracy_list, color='green')
plt.ylim(0.5, 0.9)
plt.legend(['Train Accuracy', 'Test Accuracy', 'Mutli Train Accuracy', 'Mutli Test Accuracy'], loc='upper right')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')

Advanced Version

If you wish to dynamically adjust the number of neurons and hidden layers, you can use the following approach:

neurons: The initial number of neurons. If set to 1024, it will decrease from 1024 to 16, dividing by 2 each time, until the number of neurons is less than 16.
num_layers: The number of hidden layers.

neurons = 1024 
num_layers = 5 
class MultiLayerNet(nn.Module):
    def __init__(self, D_in, D_out, neurons, num_layers):
        super(MultiLayerNet, self).__init__()
        neurons = neurons
        self.input = nn.Linear(D_in, neurons)
        self.linears = nn.ModuleList()  # Note that if you want to create multiple layers with for loops, you have to use nn.ModuleList() to create
        for i in range(num_layers):
            self.linears.append(nn.Linear(neurons, max(neurons // 2, 16)))
            neurons = max(neurons // 2, 16) 
        self.output = nn.Linear(neurons, D_out)
        self.sigmoid = nn.Sigmoid()  # Sigmoid activation for binary classification

    def forward(self, x):
        y = F.relu(self.input(x))
        for layer in self.linears:
            y = F.relu(layer(y))
        y_pred = self.sigmoid(self.output(y))
        return y_pred

Step6. Use Dropout

Here we can use dropout to avoid overfitting, mainly by randomly turning off some neurons during forward propagation, so as to avoid overfitting.

import torch.nn.functional as F

class MultiLayerNet(nn.Module):
    def __init__(self, D_in, H, D_out, num_layers, dropout_prob):
        super(MultiLayerNet, self).__init__()
        self.input = nn.Linear(D_in, H)
        self.linear1 = nn.Linear(H, 128)
        self.linear2 = nn.Linear(128, 64)
        self.linear3 = nn.Linear(64, 32)
        self.linear4 = nn.Linear(32, 16)
        self.output = nn.Linear(16, D_out)
        self.sigmoid = nn.Sigmoid()
        self.dropout = nn.Dropout(p=dropout_prob)  # ======> dropout layer

    def forward(self, x):
        y_relu = F.relu(self.input(x))
        y_relu = F.relu(self.linear1(y_relu))
        y_relu = F.relu(self.linear2(y_relu))
        y_relu = F.relu(self.linear3(y_relu))
        y_relu = F.relu(self.linear4(y_relu))
        y_relu = self.dropout(y_relu)  # ======> dropout layer
        y_pred = self.sigmoid(self.output(y_relu))
        return y_pred

At this point, you will find that the overfitting phenomenon is not so serious! Here are the results when the epoch number is set to 200.

Without Dropout

With Dropout

Advanced

The difference in the advanced version is that the number of dropout layers is the same as the number of hidden layers, and the number of dropout layers decreases with the number of hidden layers.

class MultiLayerNet(nn.Module):
    def __init__(self, D_in, D_out, neurons, num_layers, dropout_prob=0.8):
        super(MultiLayerNet, self).__init__()
        neurons = neurons
        self.input = nn.Linear(D_in, neurons)
        self.linears = nn.ModuleList()  
        self.dropouts = nn.ModuleList() #  ======> dropout layer
        for i in range(num_layers):
            self.linears.append(nn.Linear(neurons, max(neurons // 2, 16)))
            self.dropouts.append(nn.Dropout(p=dropout_prob)) # ======> dropout layer
            neurons = max(neurons // 2, 16) 
        self.output = nn.Linear(neurons, D_out) # output layer
        self.sigmoid = nn.Sigmoid()  # Sigmoid activation for binary classification

    def forward(self, x):
        y = F.relu(self.input(x))
        for layer, dropout in zip(self.linears, self.dropouts):
            y = F.relu(layer(y))
            y = dropout(y) # ======> dropout layer
        y_pred = self.sigmoid(self.output(y))
        return y_pred