Multinomial logistic regression#

The multinomial logistic regression model is a statistical model that models the probability of a record being associated with one category (out of N), based on one or more features. Here, all of the categories are independent of each other, and cannot be ordered in any meaningful way. For instance, one could try to predict the blood type of a person based on medical diagnostic values. For more information about multinomial logistic regression, see Wikipedia.

This guide explains all of the steps to fit a multinomial logistic regression model on a shared dataset, along with their options and considerations. This is done through an example with the Iris petal dataset (more about this dataset later).

The logistic regression API in crandas follows the scikit-learn API wherever possible. It is often possible to run existing scikit-learn code with minimal modifications in crandas. That being said, there are still some differences, so it is a good idea to go through this guide once. Note that crandas specifically implements the multinomial model, not the ovr model that is also used by scikit-learn.

Setup#

Before moving on with the guide, it is necessary to import a few modules and functions from crandas first:

import crandas as cd
from crandas.crlearn.logistic_regression import LogisticRegression
from crandas.crlearn.metrics import classification_accuracy
from crandas.crlearn.utils import min_max_normalize
from crandas.crlearn.metrics import confusion_matrix

Reading the data#

In this example, the dataset is read from a local CSV:

tab = cd.read_csv("../../test/logreg_test_data/iris_scaled.csv")

This imports the Iris petal dataset, which contains records of three different types of flower: iris-setosa, iris-versicolor and iris-virginica, along with measured features of these flowers. This guide demonstrates how multinomial logistic regression can be used to predict the class of flower, based on these features.

The dataset looks as following:

>>> print(tab.open().head())
       class  sepal length  sepal width       petal length  petal width
       0     0.222222    0.625000        0.067797        0.041667
       0     0.166667    0.416667        0.067797        0.041667
       0     0.111111    0.500000        0.050847        0.041667
       0     0.083333    0.458333        0.084745        0.041667
       0     0.194445    0.666667        0.067797        0.041667

Note that the features are all numerical (the sepal length, sepal width, petal length, petal width columns).

Note

See Kaggle for more information about this dataset.

Preparing the data#

Getting rid of null values#

The logistic regression can only be executed on a CDataFrame without null values (specifically, without nullable columns). If the dataset contains any missing values, one can get rid of all rows with null values using CDataFrame.dropna().

tab = tab.dropna()

An alternative to deleting the rows with null values is performing data imputation using CSeries.fillna(). However, this might introduce bias and is not recommended in the general case.

Normalizing#

If the dataset contains any numerical values (e.g. sepal length in this example), these first need to be normalized to values between 0 and 1. The way in which this is commonly done is through Min-Max-Normalization.

tab_normalized = min_max_normalize(tab, columns=['sepal length', 'sepal width', 'petal length', 'petal width'])

Here, columns can be used to specify which columns need to be normalized. The remaining columns will remain untouched.

Attention

It is essential to normalize your numerical features to within [0, 1] before you fit the model. Otherwise, fitting will not work correctly, and will return erroneous results.

Negative values are also not allowed.

Splitting into predictors, response#

First, split the predictor variables from the response variable:

X = tab_normalized[['sepal length', 'sepal width', 'petal length', 'petal width']]
y = tab_normalized[['class']]

Creating the model#

The logistic regression functionality in crandas is made accessible through the LogisticRegression class, which can be used to fit the model and make predictions. The model can be created using:

model = LogisticRegression(solver='lbfgs', multi_class="multinomial", n_classes=3)

Here, the multi_class argument specifies the type of regression to be performed, in this case multinomial. The n_classes argument specifies the number of classes in the dataset.

Note

The solver argument indicates which numerical solver the model should use to fit the model. Currently, the available options are:

lbfgs (which stands for Limited-memory BFGS)
gd (which stands for Gradient Descent)

The lbfgs solver gives better results and fits the model faster. As such, there is normally no reason to deviate from it.

Attention

It is required to specify the number of classes in the dataset. Unlike in scikit-learn, the crandas model does not detect this automatically due to the dataset being secret-shared.

Fitting the model#

Now that the data has been prepared and the model has been created, the model can be fitted to the training set:

model.fit(X, y, max_iter=40)

Here, the max_iter argument specifies how many iterations the numerical solver should perform to fit the model. The default of 10 is sufficient in some cases but sometimes it is necessary to increase this number in order for the model to fully converge. In this case, we need 40 iterations for the Iris dataset.

Note

Fitting a logistic regression model in crandas can take quite some time, depending on the number of records in the dataset, the number of features, and the number of iterations that you specify.

The fitted model parameters can now be accessed as following:

beta = model.get_beta()

This returns a matrix of size num_features + 1 by n_classes. The first column contains all of the intercept terms, while the remaining columns correspond to each of the features. Each row contains all of the terms associated with that specific class.

Predicting#

Now that the model has been fitted, it can be used to make predictions. We distinguish two different types in crandas:

probabilities: the model can predict the probability of each class being associated with the record
classes: the model can predict the class with the highest likelihood

Probabilities#

First, to predict the probabilities corresponding to each record of the test dataset:

y_pred_probabilities = model.predict_proba(X)

This returns a table with three columns (one for each class), containing the probability for each class. These sum up to one.

Classes#

Alternatively, if you are interested in making actual class predictions rather than the probabilities, you can directly predict the classes through:

y_pred_classes = model.predict(X)

Tip

Predicting classes is a quick operation, that takes significantly less time than predicting probabilities.

Assessing prediction quality#

After fitting the model, it is important to assess the quality of the model and its predictions. crandas provides a couple of methods for doing this, namely:

Classification Accuracy
Confusion Matrix.

Accuracy#

To compute the accuracy of the (class) predictions, you can use:

accuracy = classification_accuracy(y, y_pred_classes, n_classes=3)
print("Classification Accuracy:", accuracy.open())

Attention

It is required to specify the number of classes in the dataset. The function does not detect this automatically, due to the dataset being secret-shared.

Confusion Matrix#

The confusion matrix visualizes the relation between the predicted classes and the actual classes. The Y-axis represents the true class, while the X-axis represents the class predicted by the model. To compute the confusion matrix obliviously, you can use:

matrix = confusion_matrix(y, y_pred_classes, n_classes=3)
matrix.open()

[[50,  0,  0],
 [0,  44,  6],
 [0,  4, 46]]