2

I am trying to perform a linear regression on following data.

X = [[ 1 26]
 [ 2 26]
 [ 3 26]
 [ 4 26]
 [ 5 26]
 [ 6 26]
 [ 7 26]
 [ 8 26]
 [ 9 26]
 [10 26]
 [11 26]
 [12 26]
 [13 26]
 [14 26]
 [15 26]
 [16 26]
 [17 26]
 [18 26]
 [19 26]
 [20 26]
 [21 26]
 [22 26]
 [23 26]
 [24 26]
 [25 26]
 [26 26]
 [27 26]
 [28 26]
 [29 26]
 [30 26]
 [31 26]
 [32 26]
 [33 26]
 [34 26]
 [35 26]
 [36 26]
 [37 26]
 [38 26]
 [39 26]
 [40 26]
 [41 26]
 [42 26]
 [43 26]
 [44 26]
 [45 26]
 [46 26]
 [47 26]
 [48 26]
 [49 26]
 [50 26]
 [51 26]
 [52 26]
 [53 26]
 [54 26]
 [55 26]
 [56 26]
 [57 26]
 [58 26]
 [59 26]
 [60 26]
 [61 26]
 [62 26]
 [63 26]
 [64 26]
 [65 26]
 [66 26]
 [67 26]
 [68 26]
 [69 26]]

Y = [  192770 14817993  1393537   437541   514014   412468   509393   172715
   329806   425876   404031   524371   362817   692020   585431   446286
   744061   458805   330027   495654   459060   734793   701697   663319
   750496   525311  1045502   250641   500360   507594   456444   478666
   431382   495689   458200   349161   538770   355879   535924   549858
   611428   517146   239513   354071   342354   698360   467248   500903
   625170   404462  1057368   564703   700988  1352634   727453   782708
   1023673  1046348  1175588   698072   605187   684739   884551  1067267
   728643   790098   580151   340890   299185]

I am trying to plot the result to see the regression line using

regr = linear_model.LinearRegression()

regr.fit(X, Y)

plt.scatter(X[:,0], Y,  color='black')
plt.plot(X[:,0], regr.predict(X), color='blue',
     linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

The graph I get is enter image description here

('Coefficients: \n', array([-34296.90306122, 0. ])) Residual sum of squares: 1414631501323.43 Variance score: -17.94

I am trying to predict

pred = regr.predict([[49, 26]])

print pred

something which is already there in the training data and the result is [-19155.16326531]

whose actual value is 625170

What am i doing wrong ?

Please not the value of 26 is coming from a larger array, I have sliced that dat to a small portion so as to train and predict on 26, similarly the X[:,0] might not be continuous value its again coming from a larger array. By array I mean numpy array

2
  • What is X exactly? Is that a numpy array? Also is everything plotting where it should be? I'm guessing no because the result of 'pred' is so messed up
    – SAMO
    Commented Jul 28, 2016 at 16:05
  • 5
    This would be a better question if you edited it to be reproducible. For example, all the commas in your data are missing and you use both X and x interchangeably. Those are just the ones I noticed off hand.
    – Jeff
    Commented Jul 28, 2016 at 16:10

4 Answers 4

2

As SAMO said in his comment, it's not clear what your data structures are. Assuming you have two features in X and a target Y, if you convert X and Y to numpy arrays your code works as expected.

import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt

x1 = range(1, 70)
x2 = [26]*69

X = np.column_stack([x1, x2])

y = '''  192770 14817993  1393537   437541   514014   412468   509393   172715
   329806   425876   404031   524371   362817   692020   585431   446286
   744061   458805   330027   495654   459060   734793   701697   663319
   750496   525311  1045502   250641   500360   507594   456444   478666
   431382   495689   458200   349161   538770   355879   535924   549858
   611428   517146   239513   354071   342354   698360   467248   500903
   625170   404462  1057368   564703   700988  1352634   727453   782708
   1023673  1046348  1175588   698072   605187   684739   884551  1067267
   728643   790098   580151   340890   299185'''

Y = np.array(map(int, y.split()))
regr = linear_model.LinearRegression()

regr.fit(X, Y)

plt.scatter(X[:,0], Y,  color='black')
plt.plot(X[:,0], regr.predict(X), color='blue',
     linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()

print regr.predict([[49,26]])
# 611830.33589088
8
  • The answer seems to be useful but, my issue is that the data is not continuous as in the case of you answer. Commented Jul 28, 2016 at 16:17
  • @JibinMathew what do you mean by continuous? If your data is all integers then barring some condition they are continuous. Even if there is some floor or ceiling that would be reflected in the data itself.
    – SAMO
    Commented Jul 28, 2016 at 16:20
  • I mean the way nick has used to populate x1, varying continously from 1 to 70, Making this simpler, Like the data i have included is coming from a larger set of data Commented Jul 28, 2016 at 16:25
  • @JibinMathew Well that isn't a critical part of the solution, you just need to comma separate the two values in X.
    – SAMO
    Commented Jul 28, 2016 at 17:07
  • @SAMO can you verify the result , if you remove the last 20 elements from X and Y Commented Jul 28, 2016 at 17:12
1

You are probably messing with the input arrays before the plot. Given by the information in your question, the regression indeed returns a result close to your expected answer of 625170.

from sklearn import linear_model

# your input arrays
x = [[a, 26] for a in range(1, 70, 1)]
y = [192770, 14817993,1393537, 437541, 514014, 412468, 509393, 172715, 329806, 425876, 404031, 524371, 362817, 692020, 585431, 446286, 744061, 458805, 330027, 495654, 459060, 734793, 701697, 663319, 750496, 525311,1045502, 250641, 500360, 507594, 456444, 478666, 431382, 495689, 458200, 349161, 538770, 355879, 535924, 549858, 611428, 517146, 239513, 354071, 342354, 698360, 467248, 500903, 625170, 404462,1057368, 564703, 700988,1352634, 727453, 782708, 1023673,1046348,1175588, 698072, 605187, 684739, 884551,1067267, 728643, 790098, 580151, 340890, 299185]

# your code for regression
regr = linear_model.LinearRegression()
regr.fit(x, y)

# the correct coef is different from your findings
print regr.coef_

This returns a result: array([-13139.72031421, 0. ])

When trying prediction: regr.predict([[49, 26]]) returns array([ 611830.33589088]), which is close to the answer you expected.

0
print(regression.predict(np.array([[60]])))
1
  • 3
    If you post code as an answer, you should add an explanation that shows how it solves the OP's problem. Commented Apr 28, 2019 at 18:49
0

If we want to predict the single value (float) to predict on the code, that may not work. I tried in the beginning as below code, but it didn't work:

lin_reg.predict(6.5)

The solution that was found was:

lin_reg.predict([[6.5]])

Try it out if that works for you too.

Not the answer you're looking for? Browse other questions tagged or ask your own question.