Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lessons learned implementing seedable PRNGs in stdlib #28

Open
kgryte opened this issue Feb 27, 2025 · 6 comments
Open

Lessons learned implementing seedable PRNGs in stdlib #28

kgryte opened this issue Feb 27, 2025 · 6 comments

Comments

@kgryte
Copy link

kgryte commented Feb 27, 2025

Hello 👋

stdlib developer here. Just wanted to share some lessons learned and provide feedback based on our experience having developed a library of seedable PRNGs (see https://github.com/stdlib-js/stdlib/tree/develop/lib/node_modules/%40stdlib/random/base).

  1. I suggest disambiguating seed and state. These are two different concepts. state is a more general concept in PRNGs. For example, in Mersenne Twister, one can seed with an integer, but the state is a ~2.5kB array. The distinction between seed and state is not something that only we enforce, but other libraries such as NumPy do, as well (see https://numpy.org/doc/stable/reference/random/bit_generators/mt19937.html). Having an immutable seed always you to replay an entire sequence. While you could cache an initial state, depending on the algorithm, you could be forced to hold onto significantly more memory. And lastly, PRNG state may comprise other info than just the "seed". E.g., in Mersenne Twister, you need to track the index of where you are in the underlying state array.

  2. In general, I am skeptical of mandating a particular algorithm or providing any guarantees that all JS runtimes produce exactly the same results. a) What constitutes a good PRNG today does not necessarily constitute a good PRNG tomorrow. This field is constantly evolving and being beholden to a particular implementation due to backward compatibility concerns is enough to give me pause. b) Having a single seedable PRNG, while useful for reproducible testing, is not more generally optimal. When running simulations, e.g., you want to use different PRNG algorithms (not just different seeds!) to ensure that the results of your simulations are not an artifact of any one specific PRNG.

  3. I suggest having a seedable PRNG return an integer value over its range. Different PRNGs have different value ranges, and having access to those raw integer values is generally useful. Normalizing values to a range from 0 to 1 is something that is fairly straightforward to do once you know the range of possible values. Furthermore, you generally have a broader range of potential pseudorandom integers compared to normalized values due to limits of finite precision. For those users wanting normalized values, we provide a separate normalized() method and further explicitly expose the PRNGs minimum and maximum values.

  4. I don't know whether this has been covered or considered, but another aspect which is worth considering is allowing shared state among two or more PRNGs. In stdlib, we support this by allowing a user to explicitly specify whether to copy provided state during instantiation. The reason for shared state is that there are times when you want two or more PRNGs to both share the same seed, but you do not want them to produce the exact same sequence of values.

  5. Another aspect which is not covered, as far as I can tell, is the idea of supporting jumps where you want the ability to fast forward PRNG state without having to explicitly generate and discard a desired number of values. Not all PRNGs support jumps, but this can be useful during simulations when you want to ensure that PRNGs are spread far enough apart that the underlying algorithm is highly unlikely to generate the same values, even if seeded the same.

In any event, those are some initial thoughts and observations.

@bakkot
Copy link

bakkot commented Feb 27, 2025

Thanks for the feedback!

For 1, I agree, see #25 (comment) and #18 (comment).

For 2, it would give up most of the value of this proposal if the behavior was different on different runtimes, so I personally think we should not give that up. Also, with a good PRNG there's no need to run simulations with different PRNGs because the output of the PRNG will be effectively indistinguishable from noise. I think there is basically no risk of ChaCha being discovered to be "bad" such that it makes sense to replace it or use an alternative. (Go has also hardcoded ChaCha as their PRNG and their backwards compat guarantees mean they cannot change it in the future, though they could introduce a new version of their rand package.)

For 3, I think it makes sense to match Math.random by default, since that's what everyone is already using, and only add randInt etc when that's also available elsewhere.

@tabatkins
Copy link
Collaborator

  1. I'm gonna send this to a separate issue, it's worth discussing separately from the other issues. See Seed vs state #29.
  2. Strong agree with Bakkot. The whole point of this is to provide reproducible randomness; that is defeated if everyone's not using the same algo. We can add more chooseable algos in the future if necessary. And if you do need multiple algos to cross-check, having different UAs default to different algos isn't the way to achieve that (there's no guarantee they'll differ in the first place).
  3. I address this in the FAQ in the README - I'm matching Math.random() intentionally for now, with the assumption that https://github.com/tc39-transfer/proposal-random-functions will also advance; the combo of this and that will provide a seeded random int.
  4. Can you elaborate on this? Apologies, I'm not sure I understand how/why two prngs with the same seed would have different states.
  5. I've opened Support .jump()? #30 for this.

@kgryte
Copy link
Author

kgryte commented Feb 28, 2025

reproducible randomness

There are two forms of reproducibility here: strong and weak. Weak reproducibility is that, for a given runtime and runtime version, you can seed two or more PRNGs with the same seed and resolve the same sequence of values. Strong suggests that this should be the same across all environments, regardless of runtime. The latter requires much stronger backward compatibility guarantees, and, having spent a fair amount of time performing simulations and writing PRNGs, I'd be cautious about hooking my wagon to any PRNG. Opinions may differ.

provide a seeded random int

There is a distinction between returning a discrete uniform pseudorandom variate and a random integer which spans the range of the PRNG. For the former, you may need to compress or expand the range between an arbitrary min and max. That is different than a random integer on the range [0|1, max_prng_int). The latter is more fundamental and a property of the PRNG itself.

I'm not sure I understand how/why two prngs with the same seed would have different states.

A seed is what you use to initialize PRNG state. That state evolves with each generated value such that it no longer corresponds to the state initialized by the originally provided seed. When two or more PRNGs share underlying state, each generation of a pseudorandom variate mutates the state of the other PRNGs such that the next generated values no longer correspond to the sequence of values generated in the scenario where a single PRNG owns the underlying state.

@bakkot
Copy link

bakkot commented Feb 28, 2025

Weak reproducibility is much less useful. Also, on the web, people inevitably end up depending on theoretically runtime-specific details, so everyone would eventually be forced to stick to one algorithm anyway.

@kgryte
Copy link
Author

kgryte commented Feb 28, 2025

with a good PRNG there's no need to run simulations with different PRNGs because the output of the PRNG will be effectively indistinguishable from noise.

Minor quibble: in my experience, this isn't true. PRNGs have different periodicities, correlations, and statistical weaknesses. Best practice is to run simulations across different PRNG algorithms, as doing so helps detect sensitivity to the choice of PRNG. Consistent results across multiple PRNGs helps ensure that results are robust.

That said, most individuals won't be running scientific experiments in JS runtimes, so, in many cases, having a single seedable PRNG is good enough.

@tabatkins
Copy link
Collaborator

Yeah, compat pressure at web scale will almost certainly require browsers to converge on a de facto algorithm anyway. Better to shortcut that compat pain (and unpredictable race-to-bottom) and just specify the desired algorithm up front.

That is different than a random integer on the range [0|1, max_prng_int). The latter is more fundamental and a property of the PRNG itself.

At the time random ints show up in the API, we can provide a Math.MAX_PRNG_INTEGER constant so you can get that precise range if desired.

I'm sure there are good uses for a precisely-matching int, but for the common web use it'll definitely be a minority use-case, so I don't want to jump the gun and expose it ahead of any design work on randomInt().

A seed is what you use to initialize PRNG state. That state evolves with each generated value such that it no longer corresponds to the state initialized by the originally provided seed. When two or more PRNGs share underlying state, each generation of a pseudorandom variate mutates the state of the other PRNGs such that the next generated values no longer correspond to the sequence of values generated in the scenario where a single PRNG owns the underlying state.

Ah, ok, I didn't understand the process you were referring to. With the current API, you could do this by hand by copying around the state after each generation.

Could you elaborate on the use-cases for this? It sounds like it just injects a little more entropy into the sequences, but if you initialized the same set of prngs with the same seeds and pulled from them in the same sequence, you'd still get the same set of values; it's only different from a single prng because there's presumably some non-state data held by each separate prng that can differ between them and cause them to evolve the state differently. What's the benefit of this over just using generating seeds with better entropy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants