# K-nearest neighbors#

New in version 1.9.

## Introduction#

K-nearest neighbors is a predictor algorithm that predicts based on the closest
datapoints in the training data. It can be used for both classification and
regression, but currently only regression is supported. For a new datapoint it
looks at the *k* nearest neighbors based on a distance function and predicts the
target variable as the average of this variable at these data points. For more
information about k-nearest neighbors, see Wikipedia.

This guide explains all of the steps to setup a prediction model based on a shared dataset, along with their options and considerations. This is done through an example with the weather dataset, the same one used in Linear regression.

The k-nearest neighbors API in crandas follows the scikit-learn API wherever possible. It is often possible to run existing scikit-learn code with minimal modifications in crandas. That being said, there are still some differences, so it is a good idea to go through this guide once.

## Setup#

Before delving into the main guide, we need to import the necessary modules:

```
import crandas as cd
from crandas.crlearn.neighbors import KNeighborsRegressor
from crandas.crlearn.model_selection import train_test_split
```

## Reading the data#

We start by uploading the data, which exists in a csv file.

```
tab = cd.read_csv("tutorials/data/weather_data/dummy_weather_data.csv")
```

This dataset contains various weather features like wind speed, pressure and
humidity. Our goal is to predict the temperature (`Temperature (C)`

) based on
these varibles.

Note

Currently, only *integer* columns can be used with k-nearest neighbors.

## Splitting into train and test sets#

Now we need define the predictor and target variables, and then the dataset needs to be split into a training and test set. Below we will use a test size of 3.

```
# Set the predictor variables
X = tab[[
'Apparent Temperature (C)',
#'Humidity', # Fixed points are not yet supported
'Wind Speed (km/h)',
'Wind Bearing (degrees)',
'Visibility (km)',
'Cloud Cover',
'Pressure (millibars)'
]]
# Set the target variable
y = tab[['Temperature (C)']]
# Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=3)
```

## Creating the model#

K-nearest neighbors regression is implemented in the class `KNeighborsRegressor`

.

```
# Predict with k=5 and default settings
neigh = KNeighborsRegressor(n_neighbors=5)
# Predict with k=10 and p=1, where p denotes the power for the Minkowski metric
# p=1 corresponds to the Manhattan distance
neigh_10 = KNeighborsRegressor(n_neighbors=10, p=1)
# It is also possible to set weights for the columns,
# indicating by what factor the differences should be multiplied
weights = cd.DataFrame(
{
"Apparent Temperature (C)": [0],
"Humidity": [1],
"Wind Speed (km/h)": [3],
"Wind Bearing (degrees)": [1],
"Visibility (km)": [10],
"Cloud Cover": [1],
"Pressure (millibars)": [1],
},
auto_bounds=True,
)
neigh_weights = KNeighborsRegressor(n_neighbors=10, p=1, metric_weights=weights)
```

Note

The parameter `metric_weights`

takes a `CDataFrame`

, so it must be uploaded to the VDL first.

## Fitting the model#

The model can now be fit to the training set that we created earlier:

```
neigh = neigh.fit(X_train, y_train)
```

## Predicting#

Once the model has been fitted, it can be used to make predictions on the target
variable. Using the `predict_value`

method on all the data points in the test set (`X_test`

) we can predict the
values for our `Temperature (C)`

variable.

```
# Create predictions for y-values (temperature) based on X-values (predictor variables)
y_test_pred = []
y_test_pred.append(neigh.predict_value(X_test[0:1]).open())
y_test_pred.append(neigh.predict_value(X_test[1:2]).open())
y_test_pred.append(neigh.predict_value(X_test[2:3]).open())
```

```
>>> y_test_pred
[19, 23, 17]
```

Note

Currently, the method `predict_value`

only works on single values, so you need to call the method once per row of the test set.