Skip to content

Clarification on gRPC-Java KeepAliveManager shutdown trigger (PING timeout vs general read inactivity) #12487

@Trajanv

Description

@Trajanv

We’re investigating a recurring TCP RST observed ~2.5 seconds after a gRPC client sends application data (PSH, ACK) on a bidirectional stream, and we’re trying to confirm whether this behavior is expected or a side-effect of the keepalive configuration.

Environment

gRPC-Java version: [1.64.0]
Transport: Netty
Channel configured with:
.keepAliveTime(3, TimeUnit.MINUTES) .keepAliveTimeout(2, TimeUnit.SECONDS) .keepAliveWithoutCalls(true)
Server side allows keepalive and does not appear to terminate connections.

Observed behavior

The client sends an HTTP/2 DATA frame (visible as TCP PSH, ACK).
No further packets are received from the server.
Approximately 2.5 seconds later, the client issues a TCP RST.
This occurs consistently when the server does not reply or acknowledge within that interval.

https://github.com/grpc/grpc-java/blob/master/core/src/main/java/io/grpc/internal/KeepAliveManager.java

However, we do not see a ping explicitly sent at the time the RST occurs.
It appears that a timeout due to lack of any inbound data (not necessarily a PING-ACK) may trigger shutdown().

Questions

  1. Does KeepAliveManager consider only unacknowledged PINGs when starting the keepalive timeout, or anyperiod of read inactivity (including outstanding DATA frames)?
  2. If no PING was sent yet (because keepAliveTime >> 2 s), can the timeout still trigger a shutdown purely due to read inactivity?
  3. Could the RST behavior stem from the Netty transport closing the channel immediately when shutdown()fires (e.g., via Channel.close() with SO_LINGER=0)?
  4. Are there known differences between gRPC-Java and gRPC-C/C++ regarding this shutdown trigger?

Additional context

We’re analyzing this in the context of a long-lived bidirectional streaming RPC.
tcpdump shows the client’s last sent frame is application DATA, not a PING.

We suspect the combination of .keepAliveTimeout(2s) and .keepAliveTime(3min) may result in a “false positive” closure if the server doesn’t respond quickly enough after the last DATA frame.

channel =
NettyChannelBuilder.forAddress(serverHost, serverPort)
					.keepAliveTime(180, TimeUnit.SECONDS)
					.keepAliveTimeout(2, TimeUnit.SECONDS)
					.keepAliveWithoutCalls(true)

We have changed (keepAliveTimeout) this to the default (20 sec), and that does see to have an effect on when tcp-retrans occur and RST time.

We’d appreciate clarification or a reference to where in the codebase this distinction (PING ACK vs generic read inactivity) is definitively made.

Thanks for your time and for maintaining gRPC-Java.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions