K-nearest neighbors#

New in version 1.9.

Introduction#

K-nearest neighbors is a predictor algorithm that predicts based on the closest datapoints in the training data. It can be used for both classification and regression, but currently only regression is supported. For a new datapoint it looks at the k nearest neighbors based on a distance function and predicts the target variable as the average of this variable at these data points. For more information about k-nearest neighbors, see Wikipedia.

This guide explains all of the steps to setup a prediction model based on a shared dataset, along with their options and considerations. This is done through an example with the weather dataset, the same one used in Linear regression.

The k-nearest neighbors API in crandas follows the scikit-learn API wherever possible. It is often possible to run existing scikit-learn code with minimal modifications in crandas. That being said, there are still some differences, so it is a good idea to go through this guide once.

Setup#

Before delving into the main guide, we need to import the necessary modules:

import crandas as cd
from crandas.crlearn.neighbors import KNeighborsRegressor
from crandas.crlearn.model_selection import train_test_split

Reading the data#

We start by uploading the data, which exists in a csv file.

tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv")

This dataset contains various weather features like wind speed, pressure and humidity. Our goal is to predict the temperature (Temperature (C)) based on these varibles.

Note

Currently, only integer columns can be used with k-nearest neighbors.

Splitting into train and test sets#

Now we need define the predictor and target variables, and then the dataset needs to be split into a training and test set. Below we will use a test size of 3.

# Set the predictor variables
X = tab[[
    'Apparent Temperature (C)',
    #'Humidity', # Fixed points are not yet supported
    'Wind Speed (km/h)',
    'Wind Bearing (degrees)',
    'Visibility (km)',
    'Cloud Cover',
    'Pressure (millibars)'
]]

# Set the target variable
y = tab[['Temperature (C)']]

# Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3)

Creating the model#

K-nearest neighbors regression is implemented in the class KNeighborsRegressor.

# Predict with k=5 and default settings
neigh = KNeighborsRegressor(n_neighbors=5)

# Predict with k=10 and p=1, where p denotes the power for the Minkowski metric
# p=1 corresponds to the Manhattan distance
neigh_10 = KNeighborsRegressor(n_neighbors=10, p=1)

# It is also possible to set weights for the columns,
# indicating by what factor the differences should be multiplied
weights = cd.DataFrame(
    {
        "Apparent Temperature (C)": [0],
        "Humidity": [1],
        "Wind Speed (km/h)": [3],
        "Wind Bearing (degrees)": [1],
        "Visibility (km)": [10],
        "Cloud Cover": [1],
        "Pressure (millibars)": [1],
    },
    auto_bounds=True,
)
neigh_weights = KNeighborsRegressor(n_neighbors=10, p=1, metric_weights=weights)

Note

The parameter metric_weights takes a CDataFrame, so it must be uploaded to the engine first.

Fitting the model#

The model can now be fit to the training set that we created earlier:

neigh = neigh.fit(X_train, y_train)

Predicting#

Once the model has been fitted, it can be used to make predictions on the target variable. Using the predict_value method on all the data points in the test set (X_test) we can predict the values for our Temperature (C) variable.

# Create predictions for y-values (temperature) based on X-values (predictor variables)
y_test_pred = []
y_test_pred.append(neigh.predict_value(X_test[0:1]).open())
y_test_pred.append(neigh.predict_value(X_test[1:2]).open())
y_test_pred.append(neigh.predict_value(X_test[2:3]).open())

>>> y_test_pred
[19, 23, 17]

Note

Currently, the method predict_value only works on single values, so you need to call the method once per row of the test set.