Simplest way to compare Xgboost vs Pytorch vs RandomForest for multi-class classification

4 min readJan 2, 2023

Introduction

The objective of today’s discussion is to know a special way to build and compare multi-class classification models among most powerful models at present such as xgboost, RandomForest and Pytorch Deep Neural Network.

To do these kind of work, we will be using msdlib, an open-source Python package which gives you easy way to-

implement these models,
run model training and evaluation
storing plots of result comparison very easily
result comparison with necessary metrics.

All in a minimum lines of codes. Example script are uploaded here (slightly different than this)- https://github.com/abdullah-al-masud/msdlib/blob/master/examples/train_with_data_example_multi-classification.py

Data set intro

For this expedition, we will be using digit-recognizer data set from scikit-learn. This data set contains images of different digits from 0 to 9. The resolution of image data is 28 x 28 which after flattening, becomes 784. So, total number of features is 784 which are organized in each row of the data for each samples. The data set is already organized this way when we load it from scikit-learn library.

Loading necessary dependencies

# torchModel() multi-class classification example
from sklearn.ensemble import RandomForestClassifier as RFC
from xgboost import XGBClassifier as XGBC
import pandas as pd
from sklearn.datasets import load_digits
import torch
from msdlib import mlutils
from msdlib import msd

After that, we are defining a path to store the output results

savepath = 'examples/train_with_data_multi-classification

Loading data

Next, we are loading the data, feature names and putting them in a pandas DataFrame so that later we can process them easily.

source_data = load_digits()
feature_names = source_data['feature_names'].copy()
data = pd.DataFrame(source_data['data'], columns=feature_names)
label2index = {name: i for i, name in enumerate(source_data['target_names'])}
label = pd.Series(source_data['target']).replace(label2index)
# print(source_data['DESCR'])
print('data :\n', data.head())
print('labels :\n', label)
print('classes :', label.unique())

Here, we loaded the data using load_digits() function, collected feature-names, then put the data in a pandas DataFrame.

We also needed to convert the class label names into indices which will be necessary for training the models.

Feature Standardization

As we will be using Deep Neural Network, we need to standardize the features. So, we are applying z-standardization on the features which is kind of subtracting mean and dividing the the subtraction by standard deviations.

# Standardizing numerical data
data = msd.standardize(data)

Splitting data into train, validation and test

# Splitting data set into train, validation and test
splitter = msd.SplitDataset(data, label, test_ratio=.1)
outdata = splitter.random_split(val_ratio=.1)

print("outdata.keys() :", outdata.keys())
print("outdata['train'].keys() :", outdata['train'].keys())
print("outdata['validation'].keys() :", outdata['validation'].keys())
print("outdata['test'].keys() :", outdata['test'].keys())
print("train > data, labels and index shapes :",
      outdata['train']['data'].shape, outdata['train']['label'].shape, outdata['train']['index'].shape)
print("validation > data, labels and index shapes :",
      outdata['validation']['data'].shape, outdata['validation']['label'].shape, outdata['validation']['index'].shape)
print("test > data, labels and index shapes :",
      outdata['test']['data'].shape, outdata['test']['label'].shape, outdata['test']['index'].shape)

We are using random split to split the data into train, validation and test sets with ratios of 80%, 10% and 10% respectively. After that, we are printing the number of samples for each set for visual check.

Defining Neural Network model in Pytorch

Now we need to define Neural network model architecture. We will be using pytorch here and msdlib library gives support for preparing Deep Neural Network models easily. So, we will be using this library.

# defining layers inside a list
layers = mlutils.define_layers(data.shape[1], label.unique().shape[0], [100, 100, 100, 100, 100, 100], dropout_rate=.2,
                               actual_units=True, activation=torch.nn.ReLU(), model_type='regressor')

tmodel = mlutils.torchModel(layers=layers, model_type='multi-classifier',
                            savepath=savepath, batch_size=64, epoch=100, learning_rate=.0001, lr_reduce=.995)

Here, define_layers() function returns all layers of the Neural Network model. The model consists of only Linear layers (Dense layer in tensorflow). We can see that the model consists of 6 layers with 100 units each in them. Dropout is also introduces between each layers with drop-ratio of 0.2. Activation function will be used ReLU after each hidden layer. And at the output layer, there will be no activation function as model_type=’regressor’. Please feel free to pring layers, to understand properly.

Detailed description of this function can be found here- https://msdlib.readthedocs.io/en/latest/mlutils.html?highlight=define_layers#msdlib.mlutils.define_layers

After that, layers are passed through torchModel class definition which is a special class holding complete model, training and evaluation helper functions etc. We will be using those later. Please note that, we are going to run training for multi-class classification. So, here model_type is kept ‘multi-classifier’. We are using batch-size of 64, will train the model for 100 epochs, learning rate will be 0.0001 and learning rate will reduce using a multiplier of 0.995 after each epoch.

Detailed description of this torchModel class can be found here- https://msdlib.readthedocs.io/en/latest/mlutils.modeling.html?highlight=torchmodel#msdlib.mlutils.modeling.torchModel

Gathering all models

Now, for this test, we will be using Xgboost, RandomForest and pytorch Neural Network. So, we need to define them accordingly. We have already build Deep Neural Network, now we need to all it to models dictionary. Please not that, the dictionary key corresponding to Neural Network model must contain the string ‘pytorch’ to successfully execute all operations internally.

models = {
    'RandomForest': RFC(),
    'Xgboost': XGBC(),
    'pytorch-DNN': tmodel
}

Training, evaluation and storing the results in designated folder

To do all these, we just need to run on line of code.

models, predictions = mlutils.train_with_data(outdata, feature_names, models, featimp_models=['RandomForest', 'Xgboost'],
                                              figure_dir=savepath, model_type='multi-classifier', evaluate=True, figsize=(35, 5))