1

When I try the use scikit-learn LinearRegression, the model doesn't preform well, however, when I try scipy linear regression, it works perfectly, the dataset are very simple, is there a flaw in the logic or in the code?

I tried multiple linear data self_generated, all of which consisted of 1 columns for features and 1 columns for labels.

importing libraries

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from scipy import stats

generating data

X=[]
Y=[]
for i in range (100):
    X.append(2*i+3)
    Y.append(1.8*X[i]+32)
X=np.array(X,dtype=float)
Y=np.array(Y,dtype=float)

creating a model and split into test and train

reg = LinearRegression()
X_train, Y_train, X_test, Y_test = train_test_split(X, Y, test_size=0.5, random_state=0)

reshaping Test and Train since it is a single column features

X_train,X_test=(X_train.reshape(-1,1),X_test.reshape(-1,1))

fitting the training data and scoring it

reg.fit(X_train,Y_train)
reg.score(X_test,Y_test)

the score I get varies depending on the dataset size but it was never good, mostly negative,

however when I use scipy model

slope, intercept, r_value, p_value, std_err = stats.linregress(X, Y)

it works perfectly, and always find the slope 1.8 and intercept of 32

4
  • When you use stats.linregress you use the entire dataset. Why not use the same entire dataset for your sklearn model?
    – Mr_U4913
    Commented Sep 3, 2019 at 23:29
  • Even with 50% data, I get slope=1.8 and intercept=31.99 Commented Sep 3, 2019 at 23:33
  • @Mr_U4913 I did the same before I post the question with many multiple dataset slipts or entirely for training, the result was the same,
    – Serilena
    Commented Sep 4, 2019 at 12:28
  • my problem was solved by jjurado answer
    – Serilena
    Commented Sep 4, 2019 at 12:35

1 Answer 1

2

train_test_split returns the data splitted in the same order that you put the parameters, so first, return the X and then the Y. But you mixed the X and Y.

Your problem will be solved if you do this:

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.5,random_state=0)

Scipy works because you were using the whole dataset.

0

Not the answer you're looking for? Browse other questions tagged or ask your own question.