Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce HeaderValue type to replace use of String #27

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yskopets
Copy link
Contributor

@yskopets yskopets commented Aug 10, 2020

@lizan
Copy link
Member

lizan commented Aug 10, 2020

@PiotrSikora PTAL. While it's not encouraged but we do have certain traffic contains non UTF-8 header value. Since spec allows it historically, we shouldn't block that.

Historically, HTTP has allowed field content with text in the
ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
through use of [RFC2047] encoding. In practice, most HTTP header
field values use only a subset of the US-ASCII charset [USASCII].
Newly defined header fields SHOULD limit their field values to
US-ASCII octets. A recipient SHOULD treat other octets in field
content (obs-text) as opaque data.

@PiotrSikora
Copy link
Member

@PiotrSikora PTAL. While it's not encouraged but we do have certain traffic contains non UTF-8 header value. Since spec allows it historically, we shouldn't block that.

This is actually very nuanced issue, since it exposes users to various attacks... but I agree that we should allow opaque data, and I had hostcalls returning Bytes until the week of open sourcing this SDK, so I need to figure out what I was thinking at the time.

@yskopets the HeaderValue looks nice, thanks! But where exactly do you want to use it in the SDK? hostcalls or traits? My thinking right now is that hostcalls should always return Bytes and then we could have "string" and "raw" accessors to the headers in traits. Reason being that 99%+ of the header values are valid UTF-8 strings, and I don't want to return the non-UTF-8 values to users not expecting/requesting them explicitly. Does this make sense?

(Note: I need to wrap a few things over the next 2 days, so I'll get back to this later in the week, since I don't want to make this breaking change in rush.)

@yskopets
Copy link
Contributor Author

I was thinking about returning HeaderValue from the hostcalls.

A typical developer journey starts from

for (name, value) in &request_headers() {
    println!("{}={}", name, value);
} 

Returning a type like HeaderValue makes this code both correct and convenient.

For comparison, request_headers_string() would loose correctness, and request_headers_raw() would loose convenience.

And if you decide to return Vec<(String, HeaderValue)> from get_map, then it seems natural to do the same for get_map_value.

Regarding set_map/set_map_value, I think they should use generics to allow any input: HeaderValue, String, &str, Vec<u8>, &[u8].

BTW, have you looked into https://crates.io/crates/http ? I started from there, but since it doesn't allow for pseudo headers like :path and :status I ended up creating a bare minimum version with only HeaderValue. Still, it would be nice to have HeaderMap and RequestBuilder as well.

@yskopets
Copy link
Contributor Author

@PiotrSikora After looking into other use cases, I think, it makes sense to rename HeaderValue into ByteString and use it everywhere in the API instead of String.

In particular, Vec<(String, String)> could be replaced with Vec<(ByteString, ByteString)>, Bytes (currently, alias for Vec<u8>) could be replaced with ByteString.

ByteString, in this design, is responsible for:

  • equality comparison against &str, String, &[u8]
  • Index* operations
  • Display and Debug implementations that cover non-UTF-8 strings

Users who need operations such as starts_with, ends_with, replace, etc could additionally use bstr crate (ByteSlice and ByteVec traits).

Another reason to have ByteString is to move Vec<u8> out of API into implementation details. Keeping in mind iovec types from proxy-wasm/spec#1, Vec<u8> could eventually be replaced with Bytes from bytes crate (to avoid allocating a separate Vec for each header name and value).

@codefromthecrypt
Copy link

wondering if this is a blocking change to move off our fork? considering all libraries discussed are not api stable (even http is 0.2.4)

@lizan, there's an implication that there is customer traffic that not only exists, but is specifically using rust-wasm-sdk. If that's so, can you have them comment what non-utf8 headers they are using? I mean directly because some times engineers make mistakes, and libraries change to suit them, then they stop using that feature. For example, many core libraries lack support of non-string and instead deal on decoding side (ex substituting with ? when bytes are messed up). I would argue this isn't a blocking change if there is no current user of this, especially in such a young library ecosystem

@codefromthecrypt
Copy link

here's a suggested way out.

  1. identify the data that caused the panic, assuming it was customer related and not a bytes generator
  2. add a test that proves this data doesn't panic (rather replaces with ? or something if it is invalidly encoded)
  3. migrate off the fork to the api as it stands, and continue a discussion of whether or not optimizing the api for binary headers is a good idea

@codefromthecrypt
Copy link

PS on the point of the actual change here, I would be really cautious about any binary api especially looking at the broader picture and impact of change. I've had some experience making fast parsers and binary is a double-edged sword. In worst case, people end up converting to strings in tight loops and you can't cache that anymore because they only have a ref to the binary form. Even if there is a utility to cache the string, it can be awkward to have someone do non-allocating comparisons with pre-existing strings leading to the same.

Bottom line is probably there are many reasons why nearly all http apis in languages choose to not allow arbitrary binary representations at the http logical abstraction. While primarily it would be about usability, performance would likely come quick second especially in code that runs in a proxy.

Even if the customer were able to convince this particular codebase to have a api that harms others, they won't be able to change the rest of the ecosystem other languages etc. IOTW I would backfill any missing UTF-8 test cases, but stop there. If there is code here that converts binary headers into strings, just backfill tests that prove they don't panic on malicious or poorly chosen header values.

@codefromthecrypt
Copy link

last note mainly to brain dump.. If it is really important non-interference, I would classify that differently than an api. For example if there are headers in as bytes and they need to stay exactly the same bytes, malicious or not, we could have a "normal" api, which is a lazy decoded view of them. At that point, only when reading back a weird value do you have to make a decision, and even if you decode into a string with ? substitution, the raw bytes stay as they are unless overwritten. This limits impact only to the ability to intentionally add poorly chosen binary values, effectively doing this intentionally is not supported by api, but pre-existing malicious or poorly chosen values remain supported.

@Spoonbender
Copy link

Just pointing out we were dealing with a similar issue in deno, see here: denoland/deno#7208
Notably, the Content-Disposition header is known for holding non-UTF data in the wild.

This concerns me both from a

  • compatibility perspective: we need envoy filters to be able to handle all real traffic, downstream and upstream, without malforming it (can't replace non-UTF8 chars with ?)
  • security perspective: concerned about panics, which unwind at wasm32-unknown-unknown and abort on wasm32-wasi, and might be exploitable for a DoS attack perhaps

If there is a concern about breaking the proxy-wasm-rust-sdk API, thus breaking existing filters, I propose we provide a backward-compatible build option (keep old unsafe API), thus providing ability to compile the existing unsafe filters which expects strings, using cargo features + rust conditional compilation abilities. This would exist side-by-side for some grace period before the unsafe and incorrect API is being decommissioned.

@Swaagie
Copy link
Contributor

Swaagie commented Feb 18, 2022

We've actually just run into this as well with Cookie having non utf-8 characters and https://github.com/proxy-wasm/proxy-wasm-rust-sdk/blob/master/src/hostcalls.rs#L241 panicking. We'll likely work around this by using the _bytes variants of the calls and manually parse so we can capture and handle the errors appropriately.

Having a HeaderValue or ByteString available seems like a good improvement. In the interim to enable the transition should each trait call the _bytes function variants internally? Moving utf-8 parsing to the traits and perform error propagation to allow consumers to act on it. And at the same time deprecate the non byte hostcalls functions. Thoughts?

PiotrSikora pushed a commit to PiotrSikora/proxy-wasm-rust-sdk that referenced this pull request Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants