I currently have a dataset with variables and observations. I want to predict a variable (demand), which is a continuous one, thus I need to use a Regression model. I tried with Linear Regression
, and evaluated it using the R2
metric, which was around 0.85
. I wanted to evaluate its performance with other models, and one of them was the NNs
. I believe that Neural Networks are more suitable in other task like classification, nevertheless I wanted to give them a try.
I decided to use scikit-learn
mainly because it offers both models (Linear Regression and Multi Layer Perceptron), the thing is that the R2
metric was way too far and bad compared to the Linear Regression's one. Thus, I concluded that I am missing many important configurations. Below you can see my code and how the data comes.
My data has the following columns, only demand
(which is my label), population
,gdp
, day
and year
are numerical continuous, the rest are categorical.
['demand','holy','gdp','population', 'day','year', 'f0', 'f1', 'f2', 'f3', 'f4','f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20', 'f21', 'f22', 'f23', 'g0', 'g1', 'g2', 'g3', 'g4', 'g5', 'g6', 'g7', 'g8', 'g9', 'g10', 'g11']
This is what I actually do, I removed some outputs.
import pandas as pd
import numpy as np
import math as math
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import r2_score
training_data, validation_data = np.split(data.sample(frac=1), [int(.8*len(data))])
linear_model = LinearRegression().fit(training_data[[c for c in data.columns if c != "demand"]], training_data[["demand"]])
validation_data_predictions = linear_model.predict(validation_data[[c for c in training_data.columns if c != "demand"]])
validation_predictions_pd = pd.DataFrame(data=validation_data_predictions,
index=validation_data.index.values,
columns=["prediction"])
# join both pandas
result_df = validation_data.join(validation_predictions_pd, how="inner")
r2_error = r2_score(y_true=result_df[["demand"]], y_pred=result_df[["prediction"]], multioutput="uniform_average")
print(r2_error) # outputs 0.85
# NN section
clf = MLPRegressor(hidden_layer_sizes=(10,), max_iter=100000)
neural_model = clf.fit(training_data[[c for c in training_data.columns if c != "demand"]], training_data[["demand"]])
validation_data_predictions = neural_model.predict(validation_data[[c for c in training_data.columns if c != "demand"]])
validation_predictions_pd = pd.DataFrame(data=validation_data_predictions,
index=validation_data.index.values,
columns=["prediction"])
result_df = validation_data.join(validation_predictions_pd, how="inner")
r2_error = r2_score(y_true=result_df[["demand"]], y_pred=result_df[["prediction"]], multioutput="uniform_average")
print(r2_error) # outputs 0.23
So, as you can see the NN´s performance is very poor. And I think its performance can be improved, any hints?
data
variable so that others can provide you a tangible help. Cheers!