Linear regression using Sklearn prediction not working. data not fit properly

Question

I am trying to perform a linear regression on following data.

X = [[ 1 26]
 [ 2 26]
 [ 3 26]
 [ 4 26]
 [ 5 26]
 [ 6 26]
 [ 7 26]
 [ 8 26]
 [ 9 26]
 [10 26]
 [11 26]
 [12 26]
 [13 26]
 [14 26]
 [15 26]
 [16 26]
 [17 26]
 [18 26]
 [19 26]
 [20 26]
 [21 26]
 [22 26]
 [23 26]
 [24 26]
 [25 26]
 [26 26]
 [27 26]
 [28 26]
 [29 26]
 [30 26]
 [31 26]
 [32 26]
 [33 26]
 [34 26]
 [35 26]
 [36 26]
 [37 26]
 [38 26]
 [39 26]
 [40 26]
 [41 26]
 [42 26]
 [43 26]
 [44 26]
 [45 26]
 [46 26]
 [47 26]
 [48 26]
 [49 26]
 [50 26]
 [51 26]
 [52 26]
 [53 26]
 [54 26]
 [55 26]
 [56 26]
 [57 26]
 [58 26]
 [59 26]
 [60 26]
 [61 26]
 [62 26]
 [63 26]
 [64 26]
 [65 26]
 [66 26]
 [67 26]
 [68 26]
 [69 26]]

Y = [  192770 14817993  1393537   437541   514014   412468   509393   172715
   329806   425876   404031   524371   362817   692020   585431   446286
   744061   458805   330027   495654   459060   734793   701697   663319
   750496   525311  1045502   250641   500360   507594   456444   478666
   431382   495689   458200   349161   538770   355879   535924   549858
   611428   517146   239513   354071   342354   698360   467248   500903
   625170   404462  1057368   564703   700988  1352634   727453   782708
   1023673  1046348  1175588   698072   605187   684739   884551  1067267
   728643   790098   580151   340890   299185]

I am trying to plot the result to see the regression line using

regr = linear_model.LinearRegression()

regr.fit(X, Y)

plt.scatter(X[:,0], Y,  color='black')
plt.plot(X[:,0], regr.predict(X), color='blue',
     linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

The graph I get is

('Coefficients: \n', array([-34296.90306122, 0. ])) Residual sum of squares: 1414631501323.43 Variance score: -17.94

I am trying to predict

pred = regr.predict([[49, 26]])

print pred

something which is already there in the training data and the result is [-19155.16326531]

whose actual value is 625170

What am i doing wrong ?

Please not the value of 26 is coming from a larger array, I have sliced that dat to a small portion so as to train and predict on 26, similarly the X[:,0] might not be continuous value its again coming from a larger array. By array I mean numpy array

What is X exactly? Is that a numpy array? Also is everything plotting where it should be? I'm guessing no because the result of 'pred' is so messed up — SAMO, Commented Jul 28, 2016 at 16:05
This would be a better question if you edited it to be reproducible. For example, all the commas in your data are missing and you use both X and x interchangeably. Those are just the ones I noticed off hand. — Jeff, Commented Jul 28, 2016 at 16:10

Nick Becker · Accepted Answer · 2016-07-28 16:11:18Z

2

As SAMO said in his comment, it's not clear what your data structures are. Assuming you have two features in X and a target Y, if you convert X and Y to numpy arrays your code works as expected.

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

x1 = range(1, 70)
x2 = [26]*69

X = np.column_stack([x1, x2])

y = '''  192770 14817993  1393537   437541   514014   412468   509393   172715
   329806   425876   404031   524371   362817   692020   585431   446286
   744061   458805   330027   495654   459060   734793   701697   663319
   750496   525311  1045502   250641   500360   507594   456444   478666
   431382   495689   458200   349161   538770   355879   535924   549858
   611428   517146   239513   354071   342354   698360   467248   500903
   625170   404462  1057368   564703   700988  1352634   727453   782708
   1023673  1046348  1175588   698072   605187   684739   884551  1067267
   728643   790098   580151   340890   299185'''

Y = np.array(map(int, y.split()))
regr = linear_model.LinearRegression()

regr.fit(X, Y)

plt.scatter(X[:,0], Y,  color='black')
plt.plot(X[:,0], regr.predict(X), color='blue',
     linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

print regr.predict([[49,26]])
# 611830.33589088

answered Jul 28, 2016 at 16:11

Nick Becker

4,14414 silver badges20 bronze badges

The answer seems to be useful but, my issue is that the data is not continuous as in the case of you answer.
– Jibin Mathew
Commented Jul 28, 2016 at 16:17
@JibinMathew what do you mean by continuous? If your data is all integers then barring some condition they are continuous. Even if there is some floor or ceiling that would be reflected in the data itself.
– SAMO
Commented Jul 28, 2016 at 16:20
I mean the way nick has used to populate x1, varying continously from 1 to 70, Making this simpler, Like the data i have included is coming from a larger set of data
– Jibin Mathew
Commented Jul 28, 2016 at 16:25
@JibinMathew Well that isn't a critical part of the solution, you just need to comma separate the two values in X.
– SAMO
Commented Jul 28, 2016 at 17:07
@SAMO can you verify the result , if you remove the last 20 elements from X and Y
– Jibin Mathew
Commented Jul 28, 2016 at 17:12

| Show 3 more comments

Rahul Murmuria · Accepted Answer · 2016-07-28 16:29:18Z

You are probably messing with the input arrays before the plot. Given by the information in your question, the regression indeed returns a result close to your expected answer of 625170.

from sklearn import linear_model

# your input arrays
x = [[a, 26] for a in range(1, 70, 1)]
y = [192770, 14817993,1393537, 437541, 514014, 412468, 509393, 172715, 329806, 425876, 404031, 524371, 362817, 692020, 585431, 446286, 744061, 458805, 330027, 495654, 459060, 734793, 701697, 663319, 750496, 525311,1045502, 250641, 500360, 507594, 456444, 478666, 431382, 495689, 458200, 349161, 538770, 355879, 535924, 549858, 611428, 517146, 239513, 354071, 342354, 698360, 467248, 500903, 625170, 404462,1057368, 564703, 700988,1352634, 727453, 782708, 1023673,1046348,1175588, 698072, 605187, 684739, 884551,1067267, 728643, 790098, 580151, 340890, 299185]

# your code for regression
regr = linear_model.LinearRegression()
regr.fit(x, y)

# the correct coef is different from your findings
print regr.coef_

This returns a result: array([-13139.72031421, 0. ])

When trying prediction: regr.predict([[49, 26]]) returns array([ 611830.33589088]), which is close to the answer you expected.

Leonardo Alves Machado · Accepted Answer · 2019-04-28 22:29:37Z

0

print(regression.predict(np.array([[60]])))

edited Apr 28, 2019 at 22:29

Leonardo Alves Machado

2,81711 gold badges42 silver badges55 bronze badges

answered Apr 28, 2019 at 18:17

Hasan Gazi Karasahin

1

3

If you post code as an answer, you should add an explanation that shows how it solves the OP's problem.
– Mark Benningfield
Commented Apr 28, 2019 at 18:49

Add a comment |

Studocwho · Accepted Answer · 2019-08-28 19:01:20Z

0

If we want to predict the single value (float) to predict on the code, that may not work. I tried in the beginning as below code, but it didn't work:

lin_reg.predict(6.5)

The solution that was found was:

lin_reg.predict([[6.5]])

Try it out if that works for you too.

edited Aug 28, 2019 at 19:01

Studocwho

2,4613 gold badges24 silver badges30 bronze badges

answered Aug 28, 2019 at 17:59

Ishwor Bhusal

731 silver badge10 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Linear regression using Sklearn prediction not working. data not fit properly

4 Answers 4

Not the answer you're looking for? Browse other questions tagged
python
machine-learning
scikit-learn
linear-regression
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Not the answer you're looking for? Browse other questions tagged pythonmachine-learningscikit-learnlinear-regression or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
machine-learning
scikit-learn
linear-regression
or ask your own question.