Using np.random.seed(number)
has been a best practice when using Numpy to create reproducible work. Setting random seed means your work is reproducible to othersoh use your code. But now when you look at the docs for np.random.seed
the description reads as follows:
This is a handy legacy feature.
The best practice is to do not reboot a BitGenerator, but rather recreate a new one. This method is here only for legacy reasons.
So what has changed? I am going to explainIn the old method and the problems with it. Next, I will demonstrate the New best practice and its benefits.
Stop Using NumPy’s Global Random Seed – Here’s Why
Using np.random.seed(number) defines what NumPy calls the global random seed, which affects all uses of the np.random.* module. Some imported packages or other scripts may reset the global random seed to another random seed with np.random.seed(another_number), which may cause unwanted changes to your output and make your results unreproducible.
Legacy best practice
If you are looking for tutorials using np.random
you see a lot of them being used np.random.seed
to lay the foundation for reproducible work. We can see how it works:
>>> import numpy as np
>>> import numpy as np
>>> np.random.rand(4)
array([0.96176779, 0.7088082 , 0.06416725, 0.82679036])
>>> np.random.rand(4)
array([0.15051909, 0.77788803, 0.67073372, 0.32134285])
As you can see, two calls to the function lead to two completely different responses. If you want someone to be able to reproduce your projects, you can set the seed with the following code snippet:
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
You see the results are the same. If you need to prove it to yourself, you can enter the code above into your Python configuration.
Setting the seed means that the next random call is the same; it defines the random number sequence so that any code that produces or uses random numbers (with NumPy) will now produce the same number sequence. For example, look at the following:
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.rand(4)
array([0.99724328, 0.12816238, 0.17899311, 0.75292543])
>>> np.random.rand(4)
array([0.66216051, 0.78431013, 0.0968944 , 0.05857129])
>>> np.random.rand(4)
array([0.96239599, 0.61655744, 0.08662996, 0.56127236])
>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.rand(4)
array([0.99724328, 0.12816238, 0.17899311, 0.75292543])
>>> np.random.rand(4)
array([0.66216051, 0.78431013, 0.0968944 , 0.05857129])
>>> np.random.rand(4)
array([0.96239599, 0.61655744, 0.08662996, 0.56127236])
The problem with NumPy’s global random seed
You might be looking at the example above and thinking, “so what’s the problem?” You can create repeatable calls, which means that all random numbers generated after setting the seed will be the same on any machine. For the most part, that’s true; and for many projects, you may not need to worry about it.
The problem comes from larger projects or projects with imports that could also give the seed. Using np.random.seed(number)
defines what NumPy calls the global random seedwhich affects all uses of the np.random.*
module. Some imported packages or other scripts may reset the global random seed to another random seed with np.random.seed(another_number)
, which can cause unwanted changes to your output and make your results non-reproducible. In most cases, you will only need to make sure to use the same random numbers for specific parts of your code (like tests or functions).
The solution and the new method
This is one of the reasons why NumPy decided to advise users to create a random number generator for specific tasks (or even pass it around when you need parts to be repeatable).
“The preferred best practice for getting repeatable pseudo-random numbers is to instantiate a generator object with a seed and pass it around.” -Robert Kern, NEP19
Using this new best practice looks like this:
import numpy as np
>>> rng = np.random.default_rng(2021)
>>> rng.random(4)
array([0.75694783, 0.94138187, 0.59246304, 0.31884171])
As you can see, these numbers are different from the previous example because NumPy has changed the default pseudo-random number generator. However, you can replicate old results using RandomState
which is a generator of old inherited methods
>>> rng = np.random.RandomState(2021)
>>> rng.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
Benefits
You can pass random number generators between functions and classes, which means each individual or function can have its own random state without resetting the global seed. Additionally, each script could pass a random number generator to the functions that need to be repeatable. The advantage is that you know exactly which random number generator is used in each part of your project.
def f(x, rng): return rng.random(1)
#Intialise a random number generator
rng = np.random.default_rng(2021)
#pass the rng to functions which you would like to use it
random_number = f(x, rng)
Other advantages arise from parallel processing, as Albert Thomas shows us.
Using independent random number generators can help improve the reproducibility of your results. You can do this by not relying on the global random state (which can be reset or unknowingly used). Passing around a random number generator means you can follow when and How? ‘Or’ What it was used and make sure your results are the same.