Linear regression is a statistical tool used to model the linear relationship between a set of variables and a corresponding group of target values.
This guide contains the instructions to create, train and score a multiple linear regression model_.We follow the notation and structure of Sklearn, a popular python machine learning package. Default parameters are also consistent with the ones in Sklearn.
Beyond the “standard” approach to linear regression, OLS or ordinary least squares, we have implemented Ridge regression, a different estimation technique that is especially useful whenever the independent variables are highly correlated. This property, known as multicollinearity, causes difficulties in estimating separate or unique effects of individual features. As crandas works with private data from multiple sources, it might be hard to know whether variables are related, making this method especially useful.
To demonstrate the linear regression functionality we will work over a weather database, finding the relationship between temperature and features like humidity and wind speed, among others.
Before delving into the main guide, we need to import the necessary modules:
import crandas as cd from crandas.crlearn.linear_model import LinearRegression, Ridge from crandas.crlearn.model_selection import train_test_split from crandas.crlearn.metrics import score_r2
Reading the data#
We start by uploading the data, which exists in a csv file.
tab = cd.read_csv("../data/weather_data/dummy_weather_data.csv")
This dataset contains various weather features like wind speed, pressure and humidity. Our goal is to model the relationship between those variables and temperature (
Splitting into train and test sets#
Now we need define the predictor and target variables, and then the dataset needs to be split into a training and test set. Below we will use a test size of 0.3 (so 70% for training data and 30% for test data):
# Set the predictor variables X = tab[['Apparent Temperature (C)', 'Humidity', 'Wind Speed (km/h)', 'Wind Bearing (degrees)', 'Visibility (km)', 'Loud Cover', 'Pressure (millibars)']] # Set the target variable y = tab[['Temperature (C)']] # Split the data into training and test sets - you can also use the random_state variable to set a fixed seed if desired (e.g. random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
This step ensures that we have a separate set of data to evaluate the performance of our model after training.
Creating the model#
model = LinearRegression() # solver is an optional parameter, if not given it will default to 'cholesky' modelR = Ridge(solver='cholesky') # If no parameter is given, the Ridge regression has alpha = 1 modelRalpha = Ridge(solver='cholesky', alpha = 0.5)
solver argument specifies the method to be used in the calculation, but iif not specified it defaults to
cholesky. Currently, crandas only allows for
cholesky but we plan to add new solvers soon.
Fitting the model#
The model can now be fit to the training set that we created earlier (70% of the data):
model = model.fit(X_train, y_train)
Obtain the model coefficients#
Following the fitting of the model on the training data, we can obtain the model coefficients or beta coefficients. These represent the influence of each feature on the target variable. This can be done as follows:
beta = model.get_beta()
>>> beta.open() intercept Apparent Temperature (C) Humidity Wind Speed (km/h) Wind Bearing (degrees) Visibility (km) Loud Cover Pressure (millibars) 0 15.154187 0.713798 -3.539027 0.023273 0.001657 -0.067151 0.0 -0.008243
Once the model has been fitted, it can also be used to make predictions on the target variable. Using the
predict method on the test set (
X_test) we can predict the values for our
Temperature (C) variable.
# Create predictions for y-values (temperature) based on X-values (predictor variables) y_test_pred = model.predict(X_test)
>>> y_test_pred.open().head() predictions 0 7.046411 1 7.420447 2 7.927161 3 7.165771 4 8.037554
Assessing prediction quality#
Assessing the quality of the model is a critical step after fitting. We can score the model by finding the R-squared coefficient using the
>>> score_r2(y_test, y_test_pred).open() 0.6524295806884766
If we are not interested in the prediction, only on the R^2-score, we can skip a step by using the
score method. Here we can use it with our Ridge model
# We need to fit the model first modelR = modelR.fit(X_train, y_train) scoreR = modelR.score(X_test, y_test)
>>> scoreR.open() 0.6711492538452148
Both methods give us the R-squared score,. Note that they are different as different model parameters lead to different outputs for the same data.
As of v1.6, R-squared is the only metric for linear regression, but more will be implemented shortly.