`Hasher::write` should clarify its "whole unit" behaviour #94026

New issue

Closed

#94598

Closed

Hasher::write should clarify its "whole unit" behaviour#94026

#94598

Labels

T-libs-api

scottmcm

Member

Inspired by https://users.rust-lang.org/t/hash-prefix-collisions/71823/10?u=scottmcm

Hash::hash_slice has a bunch of text clarifying that h.hash_slice(&[a, b]); h.hash_slice(&[c]); is not guaranteed to be the same as h.hash_slice(&[a]); h.hash_slice(&[b, c]);.

However, Hasher::write is unclear whether that same rule applies to it. It's very clear that .write(&[a]) is not the same as .write_u8(a), but not whether the same sequence of bytes to write is supposed to be the same thing, even if they're in different groupings, like h.write(&[a, b]); h.write(&[c]); vs h.write(&[a]); h.write(&[b, c]);.

This is important for the same kind of things as the VecDeque example mentioned on hash_slice. If I have a circular byte buffer, is it legal for its Hash to just .write the two parts? Or does it need to write_u8 all the individual bytes since two circular buffers should compare equal regardless of where the split happens to be?

Given that Hash for str and Hash for [T] are doing prefix-freedom already, it feels to me like write should not be doing it again.

Also, our SipHasher implementation is going out of its way to maintain the "different chunking of writes is fine":

rust/library/core/src/hash/sip.rs

Lines 264 to 308 in 6bf3008

    
           fn write(&mut self, msg: &[u8]) { 
        
               let length = msg.len(); 
        
               self.length += length; 
        
               let mut needed = 0; 
        
               if self.ntail != 0 { 
        
                   needed = 8 - self.ntail; 
        
                   // SAFETY: `cmp::min(length, needed)` is guaranteed to not be over `length` 
        
                   self.tail |= unsafe { u8to64_le(msg, 0, cmp::min(length, needed)) } << (8 * self.ntail); 
        
                   if length < needed { 
        
                       self.ntail += length; 
        
                       return; 
        
                   } else { 
        
                       self.state.v3 ^= self.tail; 
        
                       S::c_rounds(&mut self.state); 
        
                       self.state.v0 ^= self.tail; 
        
                       self.ntail = 0; 
        
                   } 
        
               } 
        
               // Buffered tail is now flushed, process new input. 
        
               let len = length - needed; 
        
               let left = len & 0x7; // len % 8 
        
               let mut i = needed; 
        
               while i < len - left { 
        
                   // SAFETY: because `len - left` is the biggest multiple of 8 under 
        
                   // `len`, and because `i` starts at `needed` where `len` is `length - needed`, 
        
                   // `i + 8` is guaranteed to be less than or equal to `length`. 
        
                   let mi = unsafe { load_int_le!(msg, i, u64) }; 
        
                   self.state.v3 ^= mi; 
        
                   S::c_rounds(&mut self.state); 
        
                   self.state.v0 ^= mi; 
        
                   i += 8; 
        
               } 
        
               // SAFETY: `i` is now `needed + len.div_euclid(8) * 8`, 
        
               // so `i + left` = `needed + len` = `length`, which is by 
        
               // definition equal to `msg.len()`. 
        
               self.tail = unsafe { u8to64_le(msg, i, left) }; 
        
               self.ntail = left; 
        
           }

So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.

cc #80303 which lead to this text in hash_slice.

added

While we are at it, it would be good to also clarify what prefix-freedom means in the presence of a mix of write calls, write_u64 calls, etc when the type is also Eq.

I propose the following rules for Hash::hash:

Logically compress the sequence of Hasher method calls by combining consecutive write calls, concatenating the byte slices. Calls to other Hasher methods are not combined.
If x == y, the compressed sequences of calls must be identical.
If x != y:
- The compressed sequences of calls must (or just "should"?) be different.
- Additionally, after the initial sequence of identical calls, the next call must be to the same Hasher method with different arguments.
- Additionally, if the first different call is to write, the two (compressed) byte sequence arguments must be such that neither is a prefix of the other.

There should also be analogous rules for Hash::hash_slice but I'm not sure what they should be, because currently Hash documentation documentation states that it's not OK to hash VecDeque by using two calls to hash_slice. Is there a reason for this? I think it should be OK, i.e. the Hash contract should say that hashing two slices is equivalent to hashing a concatenated slice.

Either way, I think for Hash::hash_slice a decision should be made one way or the other. Either:

We treat hash_slice(&[a]); hash_slice(&[b]); as equivalent to hash_slice(&[a, b]), in which case the documentation of Hash should state that as a requirement. Or:
We treat them as different, in which case the default implementation of hash_slice should change, because it currently causes a collision on such calls.

Amanieu

Member

I strongly disagree with the notion that h.write(&[a, b]); h.write(&[c]); and h.write(&[a]); h.write(&[b, c]); are required to result in the same hash value. This would prevent a lot of optimizations on hashers, in particular with the use of unaligned memory access.

Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values. This is much more efficient than hashing each byte individually or buffering the bytes until they form a full word.

So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.

We should absolutely optimize SipHasher to be faster.

tczajka

Consider the case of a 7-byte slice: this can be hashed by reading the first 4 bytes and last 4 bytes (with 1 byte of overlap) and hashing those two values.

Wouldn't that mean that strings "abcdefg" and "abcddefg" always hash to the same value? If so, that would make a poor Hasher.

Mark-Simulacrum

Member

I think more generally, beyond the optimization for unaligned memory access, it seems easy to assume a simple hasher -- e.g., FxHash from rustc, is not going to keep a buffer around to keep 'partial' writes prior to inserting them into the hash function. If you write a partial slice, it'll still end up hashing a full usize -- so writing a series of slices rather than one large one can have a large impact.

(Effectively, this is a form of zero-padding the input buffer to fit a 8-byte block).

tczajka

If multiple calls to write are not concatenated, the section about "prefix collisions" in Hash documentation really needs rewriting, because it becomes very unclear what it means for one sequence of calls to be a prefix of another sequence of calls.

It's very clear that .write(&[a]) is not the same as .write_u8(a)

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

RalfJung

Member

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

Why that? As long as write_u8(0xff) always hashes the same way, the impl Hash for str seems fine.
(Incidentally, I just wondered what that 0xff is for anyway. There is no comment explaining it so the reader has to resort to guessing.)

bjorn3

Member

0xff can never exist in a valid UTF-8 string. The only bit patterns used in the UTF-8 encoding of any codepoint are 0xxxxxxx, 10xxxxxx, 110xxxxx, 1110xxxx and 11110xxx. 11111111 is not valid.

RalfJung

Member

That doesn't explain why write_u8 is called at all. My guess is that it serves to give a byte slice and str with the same data different hashes, but (a) it's just a guess, and (b) that still doesn't explain why one would want those hashes to differ in the first place.

tczajka

impl Hash for str currently assumes that it is the same. It calls write_u8(0xff) as the end marker rather than write(&[0xff]). If it's not the same thing, it's a bug.

Why that? As long as write_u8(0xff) always hashes the same way, the impl Hash for str seems fine.

It is required for the property that unequal values write different sequences to the Hasher (at least for standard types).

For instance, suppose that a hasher (say, SipHasher) were to treat write_u8(0xff) the same way as it treats write(&[0x41]).

Then this would cause a guaranteed collision between ("AAA", "AAA") and ("AA", "AAAA") regardless of the random seed inside SipHasher, destroying its DoS protection.

bjorn3

Member

That doesn't explain why write_u8 is called at all.

To ensure that hashing abc and the def gives a different hash from first hashing abcd and then ef.

scottmcm

MemberAuthor

Hmm, if hashers don't merge writes that's a shame since the nice \xFF trick for str ends up not really being any better, since if it'll do a whole block for the one byte anyway, and thus doesn't matter compared to using the length.

(And the AHash approach of length-prehashing on every write makes the \xFF pointless, so I feel like that's wrong for AHash to do that regardless.)

RalfJung

Member

To ensure that hashing abc and the def gives a different hash from first hashing abcd and then ef.

That sounds like a job for the slice hash function (that str calls), not something str should do. And indeed that function hashes the length, so the write_u8 is unnecessary to achieve the goal you state.

Then this would cause a guaranteed collision between ("AAA", "AAA") and ("AA", "AAAA") regardless of the random seed inside SipHasher, destroying its DoS protection.

No, it wouldn't, since the lengths of the strings are also hashed.

tczajka

That sounds like a job for the slice hash function (that str calls)

This is not true. The str Hash implementation doesn't call the slice Hash implementation, it calls Hasher write and write_u8 methods directly.

22 remaining items

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

Labels

T-libs-api

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

Add a dedicated length-prefixing method to `Hasher`rust-lang/rust

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`Hasher::write` should clarify its "whole unit" behaviour #94026

22 remaining items

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

	fn write(&mut self, msg: &[u8]) {
	let length = msg.len();
	self.length += length;

	let mut needed = 0;

	if self.ntail != 0 {
	needed = 8 - self.ntail;
	// SAFETY: `cmp::min(length, needed)` is guaranteed to not be over `length`
	self.tail \|= unsafe { u8to64_le(msg, 0, cmp::min(length, needed)) } << (8 * self.ntail);
	if length < needed {
	self.ntail += length;
	return;
	} else {
	self.state.v3 ^= self.tail;
	S::c_rounds(&mut self.state);
	self.state.v0 ^= self.tail;
	self.ntail = 0;
	}
	}

	// Buffered tail is now flushed, process new input.
	let len = length - needed;
	let left = len & 0x7; // len % 8

	let mut i = needed;
	while i < len - left {
	// SAFETY: because `len - left` is the biggest multiple of 8 under
	// `len`, and because `i` starts at `needed` where `len` is `length - needed`,
	// `i + 8` is guaranteed to be less than or equal to `length`.
	let mi = unsafe { load_int_le!(msg, i, u64) };

	self.state.v3 ^= mi;
	S::c_rounds(&mut self.state);
	self.state.v0 ^= mi;

	i += 8;
	}

	// SAFETY: `i` is now `needed + len.div_euclid(8) * 8`,
	// so `i + left` = `needed + len` = `length`, which is by
	// definition equal to `msg.len()`.
	self.tail = unsafe { u8to64_le(msg, i, left) };
	self.ntail = left;
	}

Hasher::write should clarify its "whole unit" behaviour #94026

Description

Activity

tczajka commented on Feb 15, 2022

Amanieu commented on Feb 16, 2022

tczajka commented on Feb 16, 2022

Mark-Simulacrum commented on Feb 16, 2022

tczajka commented on Feb 16, 2022

RalfJung commented on Feb 20, 2022

bjorn3 commented on Feb 20, 2022

RalfJung commented on Feb 20, 2022

tczajka commented on Feb 20, 2022

bjorn3 commented on Feb 20, 2022

scottmcm commented on Feb 20, 2022

RalfJung commented on Feb 20, 2022

tczajka commented on Feb 20, 2022

22 remaining items

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

`Hasher::write` should clarify its "whole unit" behaviour #94026