Skip to content

Conversation

@arjan-bal
Copy link
Contributor

@arjan-bal arjan-bal commented Oct 22, 2025

This PR removes 2 buffer copies while writing data frames to the underlying net.Conn: one within gRPC and the other in the framer. Care is taken to avoid any extra heap allocations which can affect performance for smaller payloads.

A CL is out for review which allows using the framer to write frame headers. This PR duplicates the header writing code as a temporary workaround. This PR will be merged only after the CL is merged.

Results

Small payloads

Performance for small payloads increases slightly due to the reduction of a deferred statement.

$ go run benchmark/benchmain/main.go -benchtime=60s -workloads=unary \
   -compression=off -maxConcurrentCalls=120 -trace=off \
   -reqSizeBytes=100 -respSizeBytes=100 -networkMode=Local -resultFile="${RUN_NAME}"

$ go run benchmark/benchresult/main.go unary-before unary-after
               Title       Before        After Percentage
            TotalOps      7600878      7653522     0.69%
             SendOps            0            0      NaN%
             RecvOps            0            0      NaN%
            Bytes/op     10007.07     10000.89    -0.07%
           Allocs/op       146.93       146.91     0.00%
             ReqT/op 101345040.00 102046960.00     0.69%
            RespT/op 101345040.00 102046960.00     0.69%
            50th-Lat    833.724µs    830.041µs    -0.44%
            90th-Lat   1.281969ms   1.275336ms    -0.52%
            99th-Lat   2.403961ms   2.360606ms    -1.80%
             Avg-Lat    946.123µs    939.734µs    -0.68%
           GoVersion     go1.24.8     go1.24.8
         GrpcVersion   1.77.0-dev   1.77.0-dev

Large payloads

Local benchmarks show a ~5-10% regression with 1 MB payloads on my dev machine. The profiles show increased time spent in the copy operation inside the buffered writer. Counterintuitively, copying the grpc header and message data into a larger buffer increased the performance by 4% (compared to master).

To validate this behaviour (extra copy increasing performance) I ran the k8s benchmark for 1MB payloads and 100 concurrent streams which showed ~5% increase in QPS without the copies across multiple runs. Adding a copy reduced the performance.

Load test config file: loadtest.yaml

# 30 core client and server
Before
QPS: 498.284 (16.6095/server core)
Latencies (50/90/95/99/99.9%-ile): 233256/275972/281250/291803/298533 us
Server system time: 93.0164
Server user time:   142.533
Client system time: 97.2688
Client user time:   144.542

After
QPS: 526.776 (17.5592/server core)
Latencies (50/90/95/99/99.9%-ile): 211010/263189/270969/280656/288828 us
Server system time: 96.5959
Server user time:   147.668
Client system time: 101.973
Client user time:   150.234

# 8 core client and server
Before
QPS: 291.049 (36.3811/server core)
Latencies (50/90/95/99/99.9%-ile): 294552/685822/903554/1.48399e+06/1.50757e+06 us
Server system time: 49.0355
Server user time:   87.1783
Client system time: 60.1945
Client user time:   103.633

After
QPS: 334.119 (41.7649/server core)
Latencies (50/90/95/99/99.9%-ile): 279395/518849/706327/1.09273e+06/1.11629e+06 us
Server system time: 69.3136
Server user time:   102.549
Client system time: 80.9804
Client user time:   107.103

RELEASE NOTES:

  • transport: Avoid two buffer copies when writing data.

@arjan-bal arjan-bal added this to the 1.77 Release milestone Oct 22, 2025
@arjan-bal arjan-bal added Type: Performance Performance improvements (CPU, network, memory, etc) Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. labels Oct 22, 2025
@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

❌ Patch coverage is 82.35294% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.07%. Comparing base (f448a97) to head (ca29c67).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
internal/transport/http_util.go 73.33% 5 Missing and 3 partials ⚠️
internal/transport/controlbuf.go 84.61% 0 Missing and 2 partials ⚠️
mem/buffer_slice.go 92.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8667      +/-   ##
==========================================
+ Coverage   81.97%   82.07%   +0.09%     
==========================================
  Files         417      417              
  Lines       40788    40851      +63     
==========================================
+ Hits        33435    33527      +92     
+ Misses       5991     5945      -46     
- Partials     1362     1379      +17     
Files with missing lines Coverage Δ
internal/transport/controlbuf.go 89.54% <84.61%> (-0.34%) ⬇️
mem/buffer_slice.go 95.75% <92.00%> (-0.68%) ⬇️
internal/transport/http_util.go 92.19% <73.33%> (-2.25%) ⬇️

... and 31 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

}

// Reader exposes a BufferSlice's data as an io.Reader, allowing it to interface
// with other parts systems. It also provides an additional convenience method
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are here, maybe you can remove mentions of this one additional convenience method. Looks like there are going to be multiple convenience methods going forward.

Comment on lines +146 to +147
// Next appends results to the provided res slice and returns the updated
// slice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This line about Next seems out of place. Maybe delete?

Comment on lines +143 to +144
// Peek returns up to the next n bytes from the reader's current position as
// a slice of byte slices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to return the actual number of bytes returned in res? Most APIs to read data do that.

// slice.
// The returned subslices are views into the underlying buffers and are only
// valid until the reader is advanced past the corresponding buffer.
Peek(n int, res [][]byte) [][]byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me as to why the result is both a parameter and a return value in this method.

}
res := c.Peek(1, nil)
if len(res) != 0 {
t.Errorf("Peek() got %v, want empty", res)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Peek() got %v slices, want empty?

tests := []struct {
name string
buffers [][]byte
operations func(t *testing.T, c mem.Reader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel like many of checks in many of the operations should result in calls to t.Fatal instead of t.Error since it doesn't make sense to carry on (for example, when you start with an empty buffer and Remaining doesn't return 0).

And if we start calling t.Fatal at different places, we also probably need to defer the call to the Close method of the reader in the main test logic.

}{
{
name: "empty",
operations: func(t *testing.T, c mem.Reader) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: why is the reader called c here and in the main test logic?

fr *http2.Framer
writer *bufWriter
fr *http2.Framer
headerBuf []byte // cached slice for framer headers to reduce heap allocs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be an array instead? [9]byte?

Comment on lines +440 to +449
f.headerBuf = append(f.headerBuf[:0],
byte(length>>16),
byte(length>>8),
byte(length),
byte(http2.FrameData),
byte(flags),
byte(streamID>>24),
byte(streamID>>16),
byte(streamID>>8),
byte(streamID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we make the headerBuf an array, does it make sense to directly set entries in the array (using indices) instead of using append?

Comment on lines +457 to +460
_, err := f.writer.Write(d)
if err != nil {
return err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: assignment and conditional on the same line?

@easwars easwars assigned arjan-bal and unassigned easwars and dfawley Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Transport Includes HTTP/2 client/server and HTTP server handler transports and advanced transport features. Type: Performance Performance improvements (CPU, network, memory, etc)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants