Description
Inspired by https://users.rust-lang.org/t/hash-prefix-collisions/71823/10?u=scottmcm
Hash::hash_slice
has a bunch of text clarifying that h.hash_slice(&[a, b]); h.hash_slice(&[c]);
is not guaranteed to be the same as h.hash_slice(&[a]); h.hash_slice(&[b, c]);
.
However, Hasher::write
is unclear whether that same rule applies to it. It's very clear that .write(&[a])
is not the same as .write_u8(a)
, but not whether the same sequence of bytes to write
is supposed to be the same thing, even if they're in different groupings, like h.write(&[a, b]); h.write(&[c]);
vs h.write(&[a]); h.write(&[b, c]);
.
This is important for the same kind of things as the VecDeque
example mentioned on hash_slice
. If I have a circular byte buffer, is it legal for its Hash
to just .write
the two parts? Or does it need to write_u8
all the individual bytes since two circular buffers should compare equal regardless of where the split happens to be?
Given that Hash for str
and Hash for [T]
are doing prefix-freedom already, it feels to me like write
should not be doing it again.
Also, our SipHasher
implementation is going out of its way to maintain the "different chunking of write
s is fine":
rust/library/core/src/hash/sip.rs
Lines 264 to 308 in 6bf3008
So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher
to be faster.
cc #80303 which lead to this text in hash_slice
.
Activity
tczajka commentedon Feb 15, 2022
While we are at it, it would be good to also clarify what prefix-freedom means in the presence of a mix of
write
calls,write_u64
calls, etc when the type is alsoEq
.I propose the following rules for
Hash::hash
:Hasher
method calls by combining consecutivewrite
calls, concatenating the byte slices. Calls to otherHasher
methods are not combined.x == y
, the compressed sequences of calls must be identical.x != y
:Hasher
method with different arguments.write
, the two (compressed) byte sequence arguments must be such that neither is a prefix of the other.There should also be analogous rules for
Hash::hash_slice
but I'm not sure what they should be, because currentlyHash
documentation documentation states that it's not OK to hashVecDeque
by using two calls tohash_slice
. Is there a reason for this? I think it should be OK, i.e. theHash
contract should say that hashing two slices is equivalent to hashing a concatenated slice.Either way, I think for
Hash::hash_slice
a decision should be made one way or the other. Either:hash_slice(&[a]); hash_slice(&[b]);
as equivalent tohash_slice(&[a, b])
, in which case the documentation ofHash
should state that as a requirement. Or:hash_slice
should change, because it currently causes a collision on such calls.Amanieu commentedon Feb 16, 2022
I strongly disagree with the notion that
h.write(&[a, b]); h.write(&[c]);
andh.write(&[a]); h.write(&[b, c]);
are required to result in the same hash value. This would prevent a lot of optimizations on hashers, in particular with the use of unaligned memory access.Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values. This is much more efficient than hashing each byte individually or buffering the bytes until they form a full word.
We should absolutely optimize SipHasher to be faster.
tczajka commentedon Feb 16, 2022
Wouldn't that mean that strings "abcdefg" and "abcddefg" always hash to the same value? If so, that would make a poor
Hasher
.Mark-Simulacrum commentedon Feb 16, 2022
I think more generally, beyond the optimization for unaligned memory access, it seems easy to assume a simple hasher -- e.g., FxHash from rustc, is not going to keep a buffer around to keep 'partial' writes prior to inserting them into the hash function. If you write a partial slice, it'll still end up hashing a full usize -- so writing a series of slices rather than one large one can have a large impact.
(Effectively, this is a form of zero-padding the input buffer to fit a 8-byte block).
tczajka commentedon Feb 16, 2022
If multiple calls to
write
are not concatenated, the section about "prefix collisions" inHash
documentation really needs rewriting, because it becomes very unclear what it means for one sequence of calls to be a prefix of another sequence of calls.impl Hash for str
currently assumes that it is the same. It callswrite_u8(0xff)
as the end marker rather thanwrite(&[0xff])
. If it's not the same thing, it's a bug.RalfJung commentedon Feb 20, 2022
Why that? As long as
write_u8(0xff)
always hashes the same way, theimpl Hash for str
seems fine.(Incidentally, I just wondered what that
0xff
is for anyway. There is no comment explaining it so the reader has to resort to guessing.)bjorn3 commentedon Feb 20, 2022
0xff
can never exist in a valid UTF-8 string. The only bit patterns used in the UTF-8 encoding of any codepoint are0xxxxxxx
,10xxxxxx
,110xxxxx
,1110xxxx
and11110xxx
.11111111
is not valid.RalfJung commentedon Feb 20, 2022
That doesn't explain why
write_u8
is called at all. My guess is that it serves to give a byte slice andstr
with the same data different hashes, but (a) it's just a guess, and (b) that still doesn't explain why one would want those hashes to differ in the first place.tczajka commentedon Feb 20, 2022
It is required for the property that unequal values write different sequences to the
Hasher
(at least for standard types).For instance, suppose that a hasher (say,
SipHasher
) were to treatwrite_u8(0xff)
the same way as it treatswrite(&[0x41])
.Then this would cause a guaranteed collision between
("AAA", "AAA")
and("AA", "AAAA")
regardless of the random seed insideSipHasher
, destroying its DoS protection.bjorn3 commentedon Feb 20, 2022
To ensure that hashing
abc
and thedef
gives a different hash from first hashingabcd
and thenef
.scottmcm commentedon Feb 20, 2022
Hmm, if hashers don't merge
write
s that's a shame since the nice\xFF
trick forstr
ends up not really being any better, since if it'll do a whole block for the one byte anyway, and thus doesn't matter compared to using the length.(And the
AHash
approach of length-prehashing on everywrite
makes the\xFF
pointless, so I feel like that's wrong forAHash
to do that regardless.)RalfJung commentedon Feb 20, 2022
That sounds like a job for the slice hash function (that
str
calls), not somethingstr
should do. And indeed that function hashes the length, so thewrite_u8
is unnecessary to achieve the goal you state.No, it wouldn't, since the lengths of the strings are also hashed.
tczajka commentedon Feb 20, 2022
This is not true. The
str
Hash
implementation doesn't call the sliceHash
implementation, it callsHasher
write
andwrite_u8
methods directly.22 remaining items