K-nearest neighbors#

New in version 1.9.

Introduction#

K-nearest neighbors is a predictor algorithm that predicts based on the closest datapoints in the training data. It can be used for both classification and regression, but currently only regression is supported. For a new datapoint it looks at the k nearest neighbors based on a distance function and predicts the target variable as the average of this variable at these data points. For more information about k-nearest neighbors, see Wikipedia.

This guide explains all of the steps to setup a prediction model based on a shared dataset, along with their options and considerations. This is done through an example with the weather dataset, the same one used in Linear regression.

The k-nearest neighbors API in crandas follows the scikit-learn API wherever possible. It is often possible to run existing scikit-learn code with minimal modifications in crandas. That being said, there are still some differences, so it is a good idea to go through this guide once.

Setup#

Before delving into the main guide, we need to import the necessary modules:

```import crandas as cd
from crandas.crlearn.neighbors import KNeighborsRegressor
from crandas.crlearn.model_selection import train_test_split
```

We start by uploading the data, which exists in a csv file.

```tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv")
```

This dataset contains various weather features like wind speed, pressure and humidity. Our goal is to predict the temperature (`Temperature (C)`) based on these varibles.

Note

Currently, only integer columns can be used with k-nearest neighbors.

Splitting into train and test sets#

Now we need define the predictor and target variables, and then the dataset needs to be split into a training and test set. Below we will use a test size of 3.

```# Set the predictor variables
X = tab[[
'Apparent Temperature (C)',
#'Humidity', # Fixed points are not yet supported
'Wind Speed (km/h)',
'Wind Bearing (degrees)',
'Visibility (km)',
'Cloud Cover',
'Pressure (millibars)'
]]

# Set the target variable
y = tab[['Temperature (C)']]

# Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3)
```

Creating the model#

K-nearest neighbors regression is implemented in the class `KNeighborsRegressor`.

```# Predict with k=5 and default settings
neigh = KNeighborsRegressor(n_neighbors=5)

# Predict with k=10 and p=1, where p denotes the power for the Minkowski metric
# p=1 corresponds to the Manhattan distance
neigh_10 = KNeighborsRegressor(n_neighbors=10, p=1)

# It is also possible to set weights for the columns,
# indicating by what factor the differences should be multiplied
weights = cd.DataFrame(
{
"Apparent Temperature (C)": [0],
"Humidity": [1],
"Wind Speed (km/h)": [3],
"Wind Bearing (degrees)": [1],
"Visibility (km)": [10],
"Cloud Cover": [1],
"Pressure (millibars)": [1],
},
auto_bounds=True,
)
neigh_weights = KNeighborsRegressor(n_neighbors=10, p=1, metric_weights=weights)
```

Note

The parameter `metric_weights` takes a `CDataFrame`, so it must be uploaded to the VDL first.

Fitting the model#

The model can now be fit to the training set that we created earlier:

```neigh = neigh.fit(X_train, y_train)
```

Predicting#

Once the model has been fitted, it can be used to make predictions on the target variable. Using the `predict_value` method on all the data points in the test set (`X_test`) we can predict the values for our `Temperature (C)` variable.

```# Create predictions for y-values (temperature) based on X-values (predictor variables)
y_test_pred = []
y_test_pred.append(neigh.predict_value(X_test[0:1]).open())
y_test_pred.append(neigh.predict_value(X_test[1:2]).open())
y_test_pred.append(neigh.predict_value(X_test[2:3]).open())
```
```>>> y_test_pred
[19, 23, 17]
```

Note

Currently, the method `predict_value` only works on single values, so you need to call the method once per row of the test set.