scikit-learn random state in splitting dataset

Question

Can anyone tell me why we set random state to zero in splitting train and test set.

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=0)

I have seen situations like this where random state is set to 1!

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.30, random_state=1)

What is the consequence of this random state in cross validation as well?

@Scott Hunter it comes from sklearn.cross_validation. But what is the effect of random state being zero and one on the the train and test split? — Shelly, Commented Feb 13, 2017 at 0:44
It's just to make sure that you obtain the same split everytime you run your script. Read up a bit on Pseudo-random-number-generators. (a number like 32525352 would have the same effect as 0 or 1; it's just a constant which is mapped to some internal state) If you don't do this, it's seeded based on time, resulting in different results in most of your runs. — sascha, Commented Feb 13, 2017 at 5:19
Possible duplicate of Random state (Pseudo-random number) in Scikit learn — Ricky Geng, Commented Apr 28, 2019 at 4:36

Vivek Kumar · Accepted Answer · 2017-02-13 06:00:34Z

68

It doesn't matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42 used in many official examples of scikit as well as elsewhere also.

random_state as the name suggests, is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices in your case. In the documentation, it is stated that:

If random_state is None or np.random, then a randomly-initialized RandomState object is returned.

If random_state is an integer, then it is used to seed a new RandomState object.

If random_state is a RandomState object, then it is passed through.

This is to check and validate the data when running the code multiple times. Setting random_state a fixed value will guarantee that same sequence of random numbers are generated each time you run the code. And unless there is some other randomness present in the process, the results produced will be same as always. This helps in verifying the output.

answered Feb 13, 2017 at 6:00

Vivek Kumar

36.2k9 gold badges114 silver badges137 bronze badges

it's strange, every time I seem to rerun my classification metrics such as Specificity and Sensitivity, etc, I get a variation in my score despite the fact that I have a set_seed. Any idea why that might be? Is there anywhere else I need to set_seed aside from the train _test split, for example under .fit() or .score or .predict()? I don't believe I have any other sources of randomness anywhere.
– bernando_vialli
Commented May 4, 2018 at 14:49
1

@mathlover I also observed the same randomness in my output as well. All i found is that when you set some value to random_state then the output like mean_absolute_error in my case get fixed( i mean every time i run it.it outputs same)
– 0xPrateek
Commented Aug 10, 2019 at 13:34
when the value itself doesn't matter, why isn't it just a boolean?
– Ben
Commented Sep 20, 2019 at 13:27
@Ben Because internally the value supplied in random_state will act as a seed for pseudo-random number generator used in numpy. When its not set, then most implementations will use the current system time as the seed. So its not proper to just set it to boolean.
– Vivek Kumar
Commented Sep 20, 2019 at 13:42
1

Random seed is often set to 42 as "The Answer to the Ultimate Question of Life, the Universe, and Everything is 42" from The Hitchhiker's Guide to the Galaxy. But think most people know this. Just in case it is of interest see... en.wikipedia.org/wiki/…
– Joey
Commented Jul 26, 2020 at 5:01

| Show 4 more comments

San · Accepted Answer · 2019-12-06 10:52:21Z

when random_state set to an integer, train_test_split will return same results for each execution.

when random_state set to an None, train_test_split will return different results for each execution.

see below example:

from sklearn.model_selection import train_test_split

X_data = range(10)
y_data = range(10)

for i in range(5):
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = 0) # zero or any other integer
    print(y_test)

print("*"*30)

for i in range(5): 
    X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size = 0.3,random_state = None)
    print(y_test)

Output:

[2, 8, 4]

[4, 7, 6]

[4, 3, 7]

[8, 1, 4]

[9, 5, 8]

[6, 4, 5]

LOrD_ARaGOrN · Accepted Answer · 2018-12-13 03:58:18Z

15

If you don't mention the random_state in the code, then whenever you execute your code a new random value is generated and the train and test datasets would have different values each time.

However, if you use a particular value for random_state(random_state = 1 or any other value) everytime the result will be same,i.e, same values in train and test datasets.

answered Dec 13, 2018 at 3:58

LOrD_ARaGOrN

4,3085 gold badges32 silver badges55 bronze badges

Add a comment |

Ganesh · Accepted Answer · 2018-12-12 20:31:53Z

The random_state splits a randomly selected data but with a twist. And the twist is the order of the data will be same for a particular value of random_state.You need to understand that it's not a bool accpeted value. starting from 0 to any integer no, if you pass as random_state,it'll be a permanent order for it. Ex: the order you will get in random_state=0 remain same. After that if you execuit random_state=5 and again come back to random_state=0 you'll get the same order. And like 0 for all integer will go same. How ever random_state=None splits randomly each time.

If still having doubt watch this

Farzana Khan · Accepted Answer · 2020-01-22 12:06:16Z

5

If you don't specify the random_state in your code, then every time you run(execute) your code a new random value is generated and the train and test datasets would have different values each time.

However, if a fixed value is assigned like random_state = 0 or 1 or 42 then no matter how many times you execute your code the result would be the same .i.e, same values in train and test datasets.

answered Jan 22, 2020 at 12:06

Farzana Khan

2,1141 gold badge8 silver badges9 bronze badges

Add a comment |

user13140964 · Accepted Answer · 2020-03-28 13:56:16Z

5

random_state is None by default which means every time when you run your program you will get different output because of splitting between train and test varies within.

random_state = any int value means every time when you run your program you will get tehe same output because of splitting between train and test does not varies within.

answered Mar 28, 2020 at 13:56

user13140964

511 silver badge1 bronze badge

Add a comment |

Babrit Behera · Accepted Answer · 2020-07-11 22:22:20Z

The random_state is an integer value which implies the selection of a random combination of train and test. When you set the test_size as 1/4 the there is a set generated of permutation and combination of train and test and each combination has one state. Suppose you have a dataset---> [1,2,3,4]

Train   |  Test   | State
[1,2,3]    [4]      **0**
[1,3,4]    [2]      **1**
[4,2,3]    [1]      **2**
[2,4,1]    [3]      **3**

We need it because while param tuning of model same state will considered again and again. So that there won't be any inference with the accuracy.

But in case of Random forest there is also similar story but in a different way w.r.t the variables.

Debasish Bhol · Accepted Answer · 2019-02-12 15:13:14Z

2

We used the random_state parameter for reproducibility of the initial shuffling of training datasets after each epoch.

answered Feb 12, 2019 at 15:13

Debasish Bhol

194 bronze badges

Add a comment |

Smart Manoj · Accepted Answer · 2020-09-05 06:57:41Z

1

For multiple times of execution of our model, random state make sure that data values will be same for training and testing data sets. It fixes the order of data for train_test_split

edited Sep 5, 2020 at 6:57

Smart Manoj

5,5665 gold badges40 silver badges61 bronze badges

answered Jan 12, 2020 at 9:40

hari

711 silver badge1 bronze badge

how does it make sure the same split? I mean can you explain the algorithm?
– RafiO
Commented Jul 9, 2023 at 22:36

Add a comment |

srihitha · Accepted Answer · 2021-07-11 08:06:05Z

0

Lets say our dataset is having one feature and 10data points. X=[0,1,2,3,4,5,6,7,8,9] and lets say 0.3(30% is testset) is specified as test data percentage then we are going to have 10C3=120 different combinations of data.[Refer picture in link for tabular explanation]: https://i.sstatic.net/FZm4a.png

Based on the random number specified system will pick random state and assigns train and test data

answered Jul 11, 2021 at 8:06

srihitha

4714 silver badges15 bronze badges

Add a comment |

Collectives™ on Stack Overflow

scikit-learn random state in splitting dataset

10 Answers 10

Not the answer you're looking for? Browse other questions tagged
python
random
machine-learning
scikit-learn
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

Not the answer you're looking for? Browse other questions tagged pythonrandommachine-learningscikit-learn or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
random
machine-learning
scikit-learn
or ask your own question.