Skip to content

Hasher::write should clarify its "whole unit" behaviour #94026

Closed
@scottmcm

Description

@scottmcm
Member

Inspired by https://users.rust-lang.org/t/hash-prefix-collisions/71823/10?u=scottmcm

Hash::hash_slice has a bunch of text clarifying that h.hash_slice(&[a, b]); h.hash_slice(&[c]); is not guaranteed to be the same as h.hash_slice(&[a]); h.hash_slice(&[b, c]);.

However, Hasher::write is unclear whether that same rule applies to it. It's very clear that .write(&[a]) is not the same as .write_u8(a), but not whether the same sequence of bytes to write is supposed to be the same thing, even if they're in different groupings, like h.write(&[a, b]); h.write(&[c]); vs h.write(&[a]); h.write(&[b, c]);.

This is important for the same kind of things as the VecDeque example mentioned on hash_slice. If I have a circular byte buffer, is it legal for its Hash to just .write the two parts? Or does it need to write_u8 all the individual bytes since two circular buffers should compare equal regardless of where the split happens to be?

Given that Hash for str and Hash for [T] are doing prefix-freedom already, it feels to me like write should not be doing it again.

Also, our SipHasher implementation is going out of its way to maintain the "different chunking of writes is fine":

fn write(&mut self, msg: &[u8]) {
let length = msg.len();
self.length += length;
let mut needed = 0;
if self.ntail != 0 {
needed = 8 - self.ntail;
// SAFETY: `cmp::min(length, needed)` is guaranteed to not be over `length`
self.tail |= unsafe { u8to64_le(msg, 0, cmp::min(length, needed)) } << (8 * self.ntail);
if length < needed {
self.ntail += length;
return;
} else {
self.state.v3 ^= self.tail;
S::c_rounds(&mut self.state);
self.state.v0 ^= self.tail;
self.ntail = 0;
}
}
// Buffered tail is now flushed, process new input.
let len = length - needed;
let left = len & 0x7; // len % 8
let mut i = needed;
while i < len - left {
// SAFETY: because `len - left` is the biggest multiple of 8 under
// `len`, and because `i` starts at `needed` where `len` is `length - needed`,
// `i + 8` is guaranteed to be less than or equal to `length`.
let mi = unsafe { load_int_le!(msg, i, u64) };
self.state.v3 ^= mi;
S::c_rounds(&mut self.state);
self.state.v0 ^= mi;
i += 8;
}
// SAFETY: `i` is now `needed + len.div_euclid(8) * 8`,
// so `i + left` = `needed + len` = `length`, which is by
// definition equal to `msg.len()`.
self.tail = unsafe { u8to64_le(msg, i, left) };
self.ntail = left;
}

So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.

cc #80303 which lead to this text in hash_slice.

Activity

added
T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.
I-libs-api-nominatedNominated for discussion during a libs-api team meeting.
on Feb 15, 2022
tczajka

tczajka commented on Feb 15, 2022

@tczajka

While we are at it, it would be good to also clarify what prefix-freedom means in the presence of a mix of write calls, write_u64 calls, etc when the type is also Eq.

I propose the following rules for Hash::hash:

  • Logically compress the sequence of Hasher method calls by combining consecutive write calls, concatenating the byte slices. Calls to other Hasher methods are not combined.
  • If x == y, the compressed sequences of calls must be identical.
  • If x != y:
    • The compressed sequences of calls must (or just "should"?) be different.
    • Additionally, after the initial sequence of identical calls, the next call must be to the same Hasher method with different arguments.
    • Additionally, if the first different call is to write, the two (compressed) byte sequence arguments must be such that neither is a prefix of the other.

There should also be analogous rules for Hash::hash_slice but I'm not sure what they should be, because currently Hash documentation documentation states that it's not OK to hash VecDeque by using two calls to hash_slice. Is there a reason for this? I think it should be OK, i.e. the Hash contract should say that hashing two slices is equivalent to hashing a concatenated slice.

Either way, I think for Hash::hash_slice a decision should be made one way or the other. Either:

  • We treat hash_slice(&[a]); hash_slice(&[b]); as equivalent to hash_slice(&[a, b]), in which case the documentation of Hash should state that as a requirement. Or:
  • We treat them as different, in which case the default implementation of hash_slice should change, because it currently causes a collision on such calls.
Amanieu

Amanieu commented on Feb 16, 2022

@Amanieu
Member

I strongly disagree with the notion that h.write(&[a, b]); h.write(&[c]); and h.write(&[a]); h.write(&[b, c]); are required to result in the same hash value. This would prevent a lot of optimizations on hashers, in particular with the use of unaligned memory access.

Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values. This is much more efficient than hashing each byte individually or buffering the bytes until they form a full word.

So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.

We should absolutely optimize SipHasher to be faster.

tczajka

tczajka commented on Feb 16, 2022

@tczajka

Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values.

Wouldn't that mean that strings "abcdefg" and "abcddefg" always hash to the same value? If so, that would make a poor Hasher.

Mark-Simulacrum

Mark-Simulacrum commented on Feb 16, 2022

@Mark-Simulacrum
Member

I think more generally, beyond the optimization for unaligned memory access, it seems easy to assume a simple hasher -- e.g., FxHash from rustc, is not going to keep a buffer around to keep 'partial' writes prior to inserting them into the hash function. If you write a partial slice, it'll still end up hashing a full usize -- so writing a series of slices rather than one large one can have a large impact.

(Effectively, this is a form of zero-padding the input buffer to fit a 8-byte block).

tczajka

tczajka commented on Feb 16, 2022

@tczajka

If multiple calls to write are not concatenated, the section about "prefix collisions" in Hash documentation really needs rewriting, because it becomes very unclear what it means for one sequence of calls to be a prefix of another sequence of calls.

It's very clear that .write(&[a]) is not the same as .write_u8(a)

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

RalfJung

RalfJung commented on Feb 20, 2022

@RalfJung
Member

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

Why that? As long as write_u8(0xff) always hashes the same way, the impl Hash for str seems fine.
(Incidentally, I just wondered what that 0xff is for anyway. There is no comment explaining it so the reader has to resort to guessing.)

bjorn3

bjorn3 commented on Feb 20, 2022

@bjorn3
Member

0xff can never exist in a valid UTF-8 string. The only bit patterns used in the UTF-8 encoding of any codepoint are 0xxxxxxx, 10xxxxxx, 110xxxxx, 1110xxxx and 11110xxx. 11111111 is not valid.

RalfJung

RalfJung commented on Feb 20, 2022

@RalfJung
Member

That doesn't explain why write_u8 is called at all. My guess is that it serves to give a byte slice and str with the same data different hashes, but (a) it's just a guess, and (b) that still doesn't explain why one would want those hashes to differ in the first place.

tczajka

tczajka commented on Feb 20, 2022

@tczajka

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

Why that? As long as write_u8(0xff) always hashes the same way, the impl Hash for str seems fine.

It is required for the property that unequal values write different sequences to the Hasher (at least for standard types).

For instance, suppose that a hasher (say, SipHasher) were to treat write_u8(0xff) the same way as it treats write(&[0x41]).

Then this would cause a guaranteed collision between ("AAA", "AAA") and ("AA", "AAAA") regardless of the random seed inside SipHasher, destroying its DoS protection.

bjorn3

bjorn3 commented on Feb 20, 2022

@bjorn3
Member

That doesn't explain why write_u8 is called at all.

To ensure that hashing abc and the def gives a different hash from first hashing abcd and then ef.

scottmcm

scottmcm commented on Feb 20, 2022

@scottmcm
MemberAuthor

Hmm, if hashers don't merge writes that's a shame since the nice \xFF trick for str ends up not really being any better, since if it'll do a whole block for the one byte anyway, and thus doesn't matter compared to using the length.

(And the AHash approach of length-prehashing on every write makes the \xFF pointless, so I feel like that's wrong for AHash to do that regardless.)

RalfJung

RalfJung commented on Feb 20, 2022

@RalfJung
Member

To ensure that hashing abc and the def gives a different hash from first hashing abcd and then ef.

That sounds like a job for the slice hash function (that str calls), not something str should do. And indeed that function hashes the length, so the write_u8 is unnecessary to achieve the goal you state.

Then this would cause a guaranteed collision between ("AAA", "AAA") and ("AA", "AAAA") regardless of the random seed inside SipHasher, destroying its DoS protection.

No, it wouldn't, since the lengths of the strings are also hashed.

tczajka

tczajka commented on Feb 20, 2022

@tczajka

That sounds like a job for the slice hash function (that str calls)

This is not true. The str Hash implementation doesn't call the slice Hash implementation, it calls Hasher write and write_u8 methods directly.

22 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @joshtriplett@Amanieu@RalfJung@Mark-Simulacrum@tczajka

      Issue actions

        `Hasher::write` should clarify its "whole unit" behaviour · Issue #94026 · rust-lang/rust