Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to set http headers without allocations #2119

Open
rabbbit opened this issue Jan 22, 2022 · 7 comments
Open

Ability to set http headers without allocations #2119

rabbbit opened this issue Jan 22, 2022 · 7 comments

Comments

@rabbbit
Copy link

rabbbit commented Jan 22, 2022

Hello,

It seems not possible to add headers without allocations. Adding a header seems to be costing 6 (or 4) allocations.

Ideas

My initial thought was that the AddHeaders API is inefficient, but @jstroem below benchmarked this not be a problem.

#### First thought - No "AddHeader" API.
type ResponseWriter interface {
	io.Writer
        AddHeaders(Headers)

however, transport.Headers always allocated two maps

type Headers struct {
	items map[string]string
	originalItems map[string]string
}

yarpc canonical headers are lowercase, while http canonical headers are capitalized.

This leads to a very paradoxical situation where:

if I add lower-case http header

	resw.AddHeaders(transport.NewHeadersWithCapacity(1).With(
		"abc", "",
	))

I get:

BenchmarkRespWriter-12           2514664               462.6 ns/op           672 B/op          4 allocs/op

This means two allocations for the maps in transport.Headers, 1 allocation for addToMetadata, and possibly 1 https://pkg.go.dev/net/http#CanonicalHeaderKey - not sure if the last one is included in my benchmark though.

But if I do correct HTTP canonical headers, which should be better

	resw.AddHeaders(transport.NewHeadersWithCapacity(1).With(
		"Abc", "",
	))

I get

BenchmarkRespWriter-12           2390634               446.2 ns/op           678 B/op          6 allocs/op

so it's worse - yarpc will convert my correct header to lower-case first (extra alloc) and one another extra alloc that I don't understand.

Possible solutions

There are a few options - which one would you recommend?

#### Just add `AddHeader`

It would be relatively easy to add ResponseWriter.AddHeader(key, value string) - this would save me two allocations, leaving the remaining two.

AddRawHeader, or make AddSystemHeader public

In order to avoid Go triggering https://pkg.go.dev/net/http#CanonicalHeaderKey for me later on anyway, I would rather pass in HTTP canonical headers (Abc and not abc).

This means I would need something like ResponseWriter.AddRawHeader(key, value string) that wouldn't do transport.CanonicalizeHeaderKey for me. But I'm not sure about the other implications of doing this.

Anything else?

Please advise.

@jstroem
Copy link

jstroem commented Feb 16, 2022

To start the discussion on this I made a commit with benchmarks and a naive implementation of AddHeader for each protocol. The commit can be found here: fc06510.

Internally in yarpc the responsewriters might have a header they build up, so I decided to only benchmark the add-step and not the entire response.

Benchmark as follows:

cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
pkg: go.uber.org/yarpc/transport/grpc
Benchmark_ResponseWriter_AddHeaders
Benchmark_ResponseWriter_AddHeaders/lowercase
Benchmark_ResponseWriter_AddHeaders/lowercase-12                 7628671               156.5 ns/op            32 B/op          1 allocs/op
Benchmark_ResponseWriter_AddHeaders/titlecase
Benchmark_ResponseWriter_AddHeaders/titlecase-12                 6875102               172.7 ns/op            35 B/op          2 allocs/op
Benchmark_ResponseWriter_AddHeader
Benchmark_ResponseWriter_AddHeader/lowercase
Benchmark_ResponseWriter_AddHeader/lowercase-12                 21641186                55.02 ns/op           32 B/op          1 allocs/op
Benchmark_ResponseWriter_AddHeader/titlecase
Benchmark_ResponseWriter_AddHeader/titlecase-12                 15583413                77.54 ns/op           35 B/op          2 allocs/op
--------------------------------------------------------
pkg: go.uber.org/yarpc/transport/http
Benchmark_ResponseWriter_AddHeaders
Benchmark_ResponseWriter_AddHeaders/lowercase
Benchmark_ResponseWriter_AddHeaders/lowercase-12                 4067061               280.8 ns/op           115 B/op          2 allocs/op
Benchmark_ResponseWriter_AddHeaders/titlecase
Benchmark_ResponseWriter_AddHeaders/titlecase-12                 4016554               291.7 ns/op           119 B/op          3 allocs/op
Benchmark_ResponseWriter_AddHeader
Benchmark_ResponseWriter_AddHeader/lowercase
Benchmark_ResponseWriter_AddHeader/lowercase-12                  6927727               193.8 ns/op           127 B/op          2 allocs/op
Benchmark_ResponseWriter_AddHeader/titlecase
Benchmark_ResponseWriter_AddHeader/titlecase-12                  5692008               200.9 ns/op           128 B/op          3 allocs/op
--------------------------------------------------------
pkg: go.uber.org/yarpc/transport/tchannel
Benchmark_ResponseWriter_AddHeaders
Benchmark_ResponseWriter_AddHeaders/lowercase
Benchmark_ResponseWriter_AddHeaders/lowercase-12                 9356582               126.7 ns/op             0 B/op          0 allocs/op
Benchmark_ResponseWriter_AddHeaders/titlecase
Benchmark_ResponseWriter_AddHeaders/titlecase-12                 5833646               204.2 ns/op             9 B/op          3 allocs/op
Benchmark_ResponseWriter_AddHeader
Benchmark_ResponseWriter_AddHeader/lowercase
Benchmark_ResponseWriter_AddHeader/lowercase-12                 28751949                42.84 ns/op            0 B/op          0 allocs/op
Benchmark_ResponseWriter_AddHeader/titlecase
Benchmark_ResponseWriter_AddHeader/titlecase-12                 11325139                88.67 ns/op            6 B/op          2 allocs/op

This shows that adding a AddHeader won't help on allocations.

@rabbbit
Copy link
Author

rabbbit commented Feb 22, 2022

Hey,

Thanks for looking into this!

There seems to be a lot to unpack here:

  • even if the benchmarks are not reducing the allocations, they saw 60% performance improvement in grpc/tchannel, and 30% reduction in http..
  • we do see allocation reduction in tchannel

So this would show a clear performance win. However, after a brief look at the code, I'm not sure if the benchmarks are valid - I think we just keep getting errors since we're adding duplicate headers (?).

Ultimately though - it should be possible to add headers with 0 allocations, right?

@rabbbit
Copy link
Author

rabbbit commented Feb 28, 2022

I looked at this a bit more, and tried to reproduce the "header creation doesn't allocate" situation.

And it appears:

  • the compiler is clever enough to inline map creation. I did not know it does this.
  • we still see 2-3x performance cost of doing AddHeaders as opposed to a single header - which is in line with your results.

The code below is somewhat convoluted - tried to reproduce what AddHeaders does with an intermediary struct containing two maps.

func newStore1(capacity int) *store1 {
	s := store1{
		store1: make(map[string]string, 1),
	}
	return &s
}

func (s *store1) with(k, v string) *store1 {
	s.store1[k] = v
	return s
}

func do1(out map[string]string, k, v string) {
	s := newStore1(1).with(k, v)
	out[k] = s.store1[k]
}

type store2 struct {
	store1 map[string]string
	store2 map[string]string
}

func newStore2(capacity int) *store2 {
	s := store2{
		store1: make(map[string]string, 1),
		store2: make(map[string]string, 1),
	}
	return &s
}

func (s *store2) with(k, v string) *store2 {
	s.store1[k] = v
	s.store2[k] = v
	return s
}

func do2(out map[string]string, k, v string) {
	s := newStore2(1).with(k, v)
	out[k] = s.store1[k]
}

func BenchmarkBaseline(b *testing.B) {
	out := make(map[string]string, 1)
	for i := 0; i < b.N; i++ {
		out["me"] = "hello"
	}
	assert.NotEqual(b, "", out["me"])
}

func BenchmarkStatic1(b *testing.B) {
	out := make(map[string]string, 1)
	for i := 0; i < b.N; i++ {
		do1(out, "me", "hello")
	}
	assert.NotEqual(b, "", out["me"])
}

func BenchmarkStatic2(b *testing.B) {
	out := make(map[string]string, 1)
	for i := 0; i < b.N; i++ {
		do2(out, "me", "hello")
	}
	assert.NotEqual(b, "", out["me"])
}

Results in:

BenchmarkBaseline
BenchmarkBaseline-12            127339767                9.543 ns/op           0 B/op          0 allocs/op
BenchmarkStatic1
BenchmarkStatic1-12             45420019                28.76 ns/op            0 B/op          0 allocs/op
BenchmarkStatic2
BenchmarkStatic2-12             25021603                47.36 ns/op            0 B/op          0 allocs/op

So significantly slower, and the cost grows for each internal map, as expected.

Adding //go:noinline to newStore{1,2} produced

BenchmarkBaseline
BenchmarkBaseline-12            126278928                9.278 ns/op           0 B/op          0 allocs/op
BenchmarkStatic1
BenchmarkStatic1-12              7363964               166.1 ns/op           344 B/op          3 allocs/op
BenchmarkStatic2
BenchmarkStatic2-12              3971288               289.1 ns/op           688 B/op          5 allocs/op

So, in summary:

  • AddHeader will not reduce allocations, but will speed things up
  • we still have a few unnecessary allocations we could likely get rid of - perhaps with headerKey inlining?

@jstroem
Copy link

jstroem commented Feb 28, 2022

I found one allocation to save in transport/grpc by switching from multierr.Combine to multierr.Append also a few others by being smart on what strings to use:

These are the optimizations:

name                                                         old time/op    new time/op    delta
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                           1.14µs ± 2%    1.12µs ±10%      ~     (p=0.251 n=9+10)
_ResponseWriter_AddHeaders/lowercase_no-value                   165ns ± 2%     103ns ±11%   -37.37%  (p=0.000 n=9+9)
_ResponseWriter_AddHeaders/titlecase                           1.21µs ± 3%    1.18µs ±11%      ~     (p=0.280 n=10+10)
_ResponseWriter_AddHeaders/titlecase_no-value                   192ns ± 4%     149ns ±15%   -22.46%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                            345ns ±14%     394ns ±10%   +14.29%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase                            357ns ± 9%     326ns ±11%    -8.69%  (p=0.021 n=9+10)
_ResponseWriter_AddHeaders/titlecase_with_rpc-header_prefix     413ns ± 4%     312ns ±15%   -24.46%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                            124ns ±10%     131ns ± 3%    +5.37%  (p=0.021 n=10+9)
_ResponseWriter_AddHeaders/titlecase                            209ns ± 9%     187ns ± 9%   -10.54%  (p=0.000 n=10+10)

name                                                         old alloc/op   new alloc/op   delta
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             264B ± 0%      232B ± 0%   -12.12%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/lowercase_no-value                   32.0B ± 0%      0.0B       -100.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase                             279B ± 0%      247B ± 0%   -11.47%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase_no-value                   35.0B ± 0%      3.0B ± 0%   -91.43%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             117B ±10%      122B ± 7%      ~     (p=0.090 n=10+10)
_ResponseWriter_AddHeaders/titlecase                             119B ± 6%      107B ± 7%   -10.17%  (p=0.000 n=9+10)
_ResponseWriter_AddHeaders/titlecase_with_rpc-header_prefix      172B ± 8%      102B ± 8%   -40.82%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                            0.00B          0.00B           ~     (all equal)
_ResponseWriter_AddHeaders/titlecase                            9.00B ± 0%     3.00B ± 0%   -66.67%  (p=0.000 n=10+10)

name                                                         old allocs/op  new allocs/op  delta
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             4.00 ± 0%      3.00 ± 0%   -25.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/lowercase_no-value                    1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase                             5.00 ± 0%      4.00 ± 0%   -20.00%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase_no-value                    2.00 ± 0%      1.00 ± 0%   -50.00%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             2.00 ± 0%      2.00 ± 0%      ~     (all equal)
_ResponseWriter_AddHeaders/titlecase                             3.00 ± 0%      2.00 ± 0%   -33.33%  (p=0.000 n=10+10)
_ResponseWriter_AddHeaders/titlecase_with_rpc-header_prefix      3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.000 n=10+10)
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeaders/lowercase                             0.00           0.00           ~     (all equal)
_ResponseWriter_AddHeaders/titlecase                             3.00 ± 0%      1.00 ± 0%   -66.67%  (p=0.000 n=10+10)

And for the addHeader:

name                                                         time/op
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                          1.04µs ± 8%
_ResponseWriter_AddHeader/lowercase_no-value                 17.9ns ± 7%
_ResponseWriter_AddHeader/titlecase                          1.05µs ±10%
_ResponseWriter_AddHeader/titlecase_no-value                 39.3ns ±11%
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                           259ns ± 5%
_ResponseWriter_AddHeader/titlecase                           161ns ± 7%
_ResponseWriter_AddHeader/titlecase_with_rpc-header_prefix    114ns ±10%
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                          47.6ns ±11%
_ResponseWriter_AddHeader/titlecase                           105ns ± 5%

name                                                         alloc/op
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            232B ± 0%
_ResponseWriter_AddHeader/lowercase_no-value                  0.00B     
_ResponseWriter_AddHeader/titlecase                            247B ± 0%
_ResponseWriter_AddHeader/titlecase_no-value                  3.00B ± 0%
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            121B ± 8%
_ResponseWriter_AddHeader/titlecase                            106B ± 8%
_ResponseWriter_AddHeader/titlecase_with_rpc-header_prefix    91.0B ±12%
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                           0.00B     
_ResponseWriter_AddHeader/titlecase                           6.00B ± 0%

name                                                         allocs/op
pkg:go.uber.org/yarpc/transport/grpc goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            3.00 ± 0%
_ResponseWriter_AddHeader/lowercase_no-value                   0.00     
_ResponseWriter_AddHeader/titlecase                            4.00 ± 0%
_ResponseWriter_AddHeader/titlecase_no-value                   1.00 ± 0%
pkg:go.uber.org/yarpc/transport/http goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            2.00 ± 0%
_ResponseWriter_AddHeader/titlecase                            1.00 ± 0%
_ResponseWriter_AddHeader/titlecase_with_rpc-header_prefix     0.00     
pkg:go.uber.org/yarpc/transport/tchannel goos:darwin goarch:amd64
_ResponseWriter_AddHeader/lowercase                            0.00     
_ResponseWriter_AddHeader/titlecase                            2.00 ± 0%

So as you say, AddHeader is faster and AddHeaders also becomes a bit faster best-case because of fewer allocations.

@jstroem
Copy link

jstroem commented Feb 28, 2022

I've created a PR with the current optimizations.

@biosvs biosvs closed this as completed Dec 12, 2024
@rabbbit
Copy link
Author

rabbbit commented Feb 21, 2025

@biosvs I don't think this one was actually completed - should we re-open? I still think this would be a significant improvement overall.

@biosvs biosvs reopened this Feb 25, 2025
@biosvs
Copy link
Collaborator

biosvs commented Feb 25, 2025

Yes, let's re-open this, as performance question was raised recently. I'll take a look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants