Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing gRPC xDS Fallback - Multiple xDS servers #8057

Closed
axkum10 opened this issue Jan 30, 2025 · 8 comments
Closed

Failing gRPC xDS Fallback - Multiple xDS servers #8057

axkum10 opened this issue Jan 30, 2025 · 8 comments
Assignees
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug

Comments

@axkum10
Copy link

axkum10 commented Jan 30, 2025

I have setup gRPC xDS server with client and server working fine. I am trying to introduce multiple gRPC xDS server for high availability.
As per A71 xDS fallback introducing multiple xDS server detail in client bootstrap should have enabled client to look pick a good xDS server, which is not happening. Anyone tried multiple gRPC xDS server and tested client fallback.

Here is client bootstrap file:

{
  "xds_servers": [
    {
      "server_uri": "xds.domain.com:8001",
      "channel_creds": [
        {
          "type": "insecure"
        }
      ],
      "server_features": ["xds_v3"]
    },
    {
      "server_uri": "xds.domain.com:8003",
      "channel_creds": [
        {
          "type": "insecure"
        }
      ],
      "server_features": ["xds_v3"]
    }
  ],
  "node": {
    "id": "test123",
    "metadata": {
      "md": "123456789012"
    }
  }
}

Setup: I have just one xDS server running on port 8003.
Steps:

Started xDS server using port 8003 - good
exported client bootstarp file - good
Starting client - error (client startup is not looking at 2nd xDS server detail) Error:

2025/01/29 09:19:14 could not greet: rpc error: code = Unavailable desc = xds: error received from xDS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
exit status 1

I have not touched "GRPC_EXPERIMENTAL_XDS_FALLBACK", as the comment says, failover is enabled by default.

Software:

go 1.23.0
grpc v1.67.1
host machine: x86_64 GNU/Linux
@purnesh42H
Copy link
Contributor

@axkum10 the grpc version you have mentioned in the issue is 1.67.1 but the fallback support was added in 1.69.0 and it was enabled by default in 1.70.0. See the release notes https://github.com/grpc/grpc-go/releases.

Could you try upgrading your grpc library version? The bootstrap config looks correct to me. You can also refer to these internal e2e tests to get more idea https://github.com/grpc/grpc-go/blob/master/xds/internal/xdsclient/tests/fallback_test.go.

@easwars for providing more guidance here.

@purnesh42H purnesh42H added Type: Question Area: xDS Includes everything xDS related, including LB policies used with xDS. Status: Requires Reporter Clarification labels Feb 2, 2025
@axkum10
Copy link
Author

axkum10 commented Feb 3, 2025

Thanks @purnesh42H, upgraded version, still failing to connect to 2nd xDS server. I do not see client making a call with new port of xDS server. It does pick, however I do not see a connect call with different port going out in debug.

Test/Scenario: Only one xDS server running and mentioned as 2nd server in the client bootstrap file. Trying to find, if gRPC client can connect to 2nd xDS server if 1st is down.

Setup refresh:
2 xDS servers mentioned in bootstrap with port 8001 & 8003. xDS at port 8001 is down (not running) and xDS at port 8003 is active/running.
Client with bootstrap file with 2 xDS server mentioned (refer original/1st comment/question above)

Client works fine if I put one xDS server in the bootstrap file (tried both 8001 & 8003)

Software version:

go 1.23.0
google.golang.org/grpc v1.70.0

Please refer below 2 debug logs with fallback flag false & true:

  1. DEBUG, when GRPC_EXPERIMENTAL_XDS_FALLBACK = false
    The Client picks just one xDS server.
[app]$ export GRPC_EXPERIMENTAL_XDS_FALLBACK=false
[app]$ go run src/client.go --host xds:///upstream.host.foo
2025/02/03 09:36:04 INFO: [core] original dial target is: "xds:///upstream.host.foo"
2025/02/03 09:36:04 INFO: [core] [Channel #1]Channel created
2025/02/03 09:36:04 INFO: [core] [Channel #1]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"xds", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/upstream.host.foo", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2025/02/03 09:36:04 INFO: [core] [Channel #1]Channel authority set to "upstream.host.foo"
2025/02/03 09:36:04 INFO: [xds] [xds-resolver 0xc00027f7a0] Creating resolver for target: xds:///upstream.host.foo
2025/02/03 09:36:04 INFO: [xds] [xds-bootstrap] Using bootstrap file with name "/home/dc-user/grpc/example/employee/go/dynamic/xds/multiple_xDS_client_xds_bootstrap.json" from GRPC_XDS_BOOTSTRAP environment variable
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] Created client with name "xds:///upstream.host.foo" and bootstrap configuration:
 {
  "xds_servers": [
   {
    "server_uri": "xds.domain.com:8001",
    "channel_creds": [
     {
      "type": "insecure"
     }
    ],
    "server_features": [
     "xds_v3"
    ]
   }
  ],
  "client_default_listener_resource_name_template": "%s",
  "node": {
   "id": "test_123",
   "locality": {},
   "metadata": {
    "R_GCP_PROJECT_NUMBER": "756692"
   }
  }
 }
2025/02/03 09:36:04 INFO: [xds] xDS node ID: test_service_echo
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] [authority ""] New watch for type "ListenerResource", resource name "upstream.host.foo"
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] Received request for a reference to an xdsChannel for server config "xds.domain.com:8001-insecure-xds_v3"
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] Creating a new xdsChannel for server config "xds.domain.com:8001-insecure-xds_v3"
2025/02/03 09:36:04 INFO: [core] original dial target is: "xds.domain.com:8001"
2025/02/03 09:36:04 INFO: [core] [Channel #2]Channel created
2025/02/03 09:36:04 INFO: [core] [Channel #2]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/xds.domain.com:8001", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2025/02/03 09:36:04 INFO: [core] [Channel #2]Channel authority set to "xds.domain.com:8001"
2025/02/03 09:36:04 INFO: [core] [Channel #2]Channel exiting idle mode
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] Adding authority "" to the set of interested authorities for channel [0xc0002f2180]
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] [authority ""] First watch for type "ListenerResource", resource name "upstream.host.foo"
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] [xds-channel 0xc0002f2180] [ads-stream 0xc00021ae80] Subscribing to resource "upstream.host.foo" of type "ListenerResource"
2025/02/03 09:36:04 INFO: [core] [Channel #1]Channel exiting idle mode
2025/02/03 09:36:04 INFO: [core] [Channel #2]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "127.0.0.1:8001",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "127.0.0.1:8001",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2025/02/03 09:36:04 INFO: [core] [Channel #2]Channel switches to new LB policy "pick_first"
2025/02/03 09:36:04 INFO: [pick-first-lb] [pick-first-lb 0xc000312510] Received new config {
  "shuffleAddressList": false
}, resolver state {
  "Addresses": [
    {
      "Addr": "127.0.0.1:8001",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "127.0.0.1:8001",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
}
2025/02/03 09:36:04 INFO: [core] [Channel #2 SubChannel #3]Subchannel created
2025/02/03 09:36:04 INFO: [core] [Channel #2]Channel Connectivity change to CONNECTING
2025/02/03 09:36:04 INFO: [core] [Channel #2 SubChannel #3]Subchannel Connectivity change to CONNECTING
2025/02/03 09:36:04 INFO: [core] [Channel #2 SubChannel #3]Subchannel picks a new address "127.0.0.1:8001" to connect
2025/02/03 09:36:04 INFO: [pick-first-lb] [pick-first-lb 0xc000312510] Received SubConn state update: 0xc00031c190, {ConnectivityState:CONNECTING ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025/02/03 09:36:04 INFO: [core] Creating new client transport to "{Addr: \"127.0.0.1:8001\", ServerName: \"xds.domain.com:8001\", }": connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:04 WARNING: [core] [Channel #2 SubChannel #3]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:8001", ServerName: "xds.domain.com:8001", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:04 INFO: [core] [Channel #2 SubChannel #3]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:04 INFO: [pick-first-lb] [pick-first-lb 0xc000312510] Received SubConn state update: 0xc00031c190, {ConnectivityState:TRANSIENT_FAILURE ConnectionError:connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused" connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025/02/03 09:36:04 INFO: [core] [Channel #2]Channel Connectivity change to TRANSIENT_FAILURE
2025/02/03 09:36:04 WARNING: [xds] [xds-client 0xc00029f260] [xds-channel 0xc0002f2180] [ads-stream 0xc00021ae80] Failed to create a new ADS streaming RPC: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] [authority ""] Connection to server xds.domain.com:8001-insecure-xds_v3 failed with error: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] [authority ""] Attempting to initiate fallback after failure from server "xds.domain.com:8001-insecure-xds_v3"
2025/02/03 09:36:04 INFO: [xds] [xds-client 0xc00029f260] [authority ""] No more servers to fallback to
2025/02/03 09:36:04 INFO: [xds] [xds-resolver 0xc00027f7a0] Received error for Listener resource "upstream.host.foo": xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:04 WARNING: [core] [Channel #1]ccResolverWrapper: reporting error to cc: xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:04 could not greet: rpc error: code = Unavailable desc = xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
exit status 1

  1. DEBUG, when GRPC_EXPERIMENTAL_XDS_FALLBACK = true
    Client looks at both xDS server detail from bootstrap. Last 9-10 lines (9th from bottom) mentions "xds.domain.com:8003" that's all.
[app]$ export GRPC_EXPERIMENTAL_XDS_FALLBACK=true
[app]$ go run src/client.go --host xds:///upstream.host.foo
2025/02/03 09:36:27 CHECK FALL BACK
2025/02/03 09:36:27 FALLBACK:true
2025/02/03 09:36:27 INFO: [core] original dial target is: "xds:///upstream.host.foo"
2025/02/03 09:36:27 INFO: [core] [Channel #1]Channel created
2025/02/03 09:36:27 INFO: [core] [Channel #1]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"xds", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/upstream.host.foo", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2025/02/03 09:36:27 INFO: [core] [Channel #1]Channel authority set to "upstream.host.foo"
2025/02/03 09:36:27 INFO: [xds] [xds-resolver 0xc00033a360] Creating resolver for target: xds:///upstream.host.foo
2025/02/03 09:36:27 INFO: [xds] [xds-bootstrap] Using bootstrap file with name "/home/dc-user/grpc/example/employee/go/dynamic/xds/multiple_xDS_client_xds_bootstrap.json" from GRPC_XDS_BOOTSTRAP environment variable
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Created client with name "xds:///upstream.host.foo" and bootstrap configuration:
 {
  "xds_servers": [
   {
    "server_uri": "xds.domain.com:8001",
    "channel_creds": [
     {
      "type": "insecure"
     }
    ],
    "server_features": [
     "xds_v3"
    ]
   },
   {
    "server_uri": "xds.domain.com:8003",
    "channel_creds": [
     {
      "type": "insecure"
     }
    ],
    "server_features": [
     "xds_v3"
    ]
   }
  ],
  "client_default_listener_resource_name_template": "%s",
  "node": {
   "id": "test_123",
   "locality": {},
   "metadata": {
    "R_NUMBER": "756692"
   }
  }
 }
2025/02/03 09:36:27 INFO: [xds] xDS node ID: test_service_echo
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] New watch for type "ListenerResource", resource name "upstream.host.foo"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Received request for a reference to an xdsChannel for server config "xds.domain.com:8001-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Creating a new xdsChannel for server config "xds.domain.com:8001-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [core] original dial target is: "xds.domain.com:8001"
2025/02/03 09:36:27 INFO: [core] [Channel #2]Channel created
2025/02/03 09:36:27 INFO: [core] [Channel #2]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/xds.domain.com:8001", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2025/02/03 09:36:27 INFO: [core] [Channel #2]Channel authority set to "xds.domain.com:8001"
2025/02/03 09:36:27 INFO: [core] [Channel #2]Channel exiting idle mode
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Adding authority "" to the set of interested authorities for channel [0xc000362120]
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] First watch for type "ListenerResource", resource name "upstream.host.foo"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [xds-channel 0xc000362120] [ads-stream 0xc000294e80] Subscribing to resource "upstream.host.foo" of type "ListenerResource"
2025/02/03 09:36:27 INFO: [core] [Channel #1]Channel exiting idle mode
2025/02/03 09:36:27 INFO: [core] [Channel #2]Resolver state updated: {
  "Addresses": [
    {
      "Addr": "127.0.0.1:8001",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "127.0.0.1:8001",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2025/02/03 09:36:27 INFO: [core] [Channel #2]Channel switches to new LB policy "pick_first"
2025/02/03 09:36:27 INFO: [pick-first-lb] [pick-first-lb 0xc0003b0330] Received new config {
  "shuffleAddressList": false
}, resolver state {
  "Addresses": [
    {
      "Addr": "127.0.0.1:8001",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Metadata": null
    }
  ],
  "Endpoints": [
    {
      "Addresses": [
        {
          "Addr": "127.0.0.1:8001",
          "ServerName": "",
          "Attributes": null,
          "BalancerAttributes": null,
          "Metadata": null
        }
      ],
      "Attributes": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
}
2025/02/03 09:36:27 INFO: [core] [Channel #2 SubChannel #3]Subchannel created
2025/02/03 09:36:27 INFO: [core] [Channel #2]Channel Connectivity change to CONNECTING
2025/02/03 09:36:27 INFO: [core] [Channel #2 SubChannel #3]Subchannel Connectivity change to CONNECTING
2025/02/03 09:36:27 INFO: [core] [Channel #2 SubChannel #3]Subchannel picks a new address "127.0.0.1:8001" to connect
2025/02/03 09:36:27 INFO: [pick-first-lb] [pick-first-lb 0xc0003b0330] Received SubConn state update: 0xc0003904b0, {ConnectivityState:CONNECTING ConnectionError:<nil> connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025/02/03 09:36:27 INFO: [core] Creating new client transport to "{Addr: \"127.0.0.1:8001\", ServerName: \"xds.domain.com:8001\", }": connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 WARNING: [core] [Channel #2 SubChannel #3]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:8001", ServerName: "xds.domain.com:8001", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 INFO: [core] [Channel #2 SubChannel #3]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 INFO: [pick-first-lb] [pick-first-lb 0xc0003b0330] Received SubConn state update: 0xc0003904b0, {ConnectivityState:TRANSIENT_FAILURE ConnectionError:connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused" connectedAddress:{Addr: ServerName: Attributes:<nil> BalancerAttributes:<nil> Metadata:<nil>}}
2025/02/03 09:36:27 INFO: [core] [Channel #2]Channel Connectivity change to TRANSIENT_FAILURE
2025/02/03 09:36:27 WARNING: [xds] [xds-client 0xc000309260] [xds-channel 0xc000362120] [ads-stream 0xc000294e80] Failed to create a new ADS streaming RPC: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Connection to server xds.domain.com:8001-insecure-xds_v3 failed with error: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Attempting to initiate fallback after failure from server "xds.domain.com:8001-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Initiating fallback to server xds.domain.com:8003-insecure-xds_v3
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Received request for a reference to an xdsChannel for server config "xds.domain.com:8003-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Creating a new xdsChannel for server config "xds.domain.com:8003-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [core] original dial target is: "xds.domain.com:8003"
2025/02/03 09:36:27 INFO: [core] [Channel #4]Channel created
2025/02/03 09:36:27 INFO: [core] [Channel #4]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/xds.domain.com:8003", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2025/02/03 09:36:27 INFO: [core] [Channel #4]Channel authority set to "xds.domain.com:8003"
2025/02/03 09:36:27 INFO: [xds] [xds-resolver 0xc00033a360] Received error for Listener resource "upstream.host.foo": xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 WARNING: [core] [Channel #1]ccResolverWrapper: reporting error to cc: xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 INFO: [core] [Channel #4]Channel exiting idle mode
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Adding authority "" to the set of interested authorities for channel [0xc0003631a0]
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Resubscribing to resource of type "ListenerResource" and name "upstream.host.foo"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [xds-channel 0xc0003631a0] [ads-stream 0xc000295380] Subscribing to resource "upstream.host.foo" of type "ListenerResource"
2025/02/03 09:36:27 could not greet: rpc error: code = Unavailable desc = xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
exit status 1

This one line (from above debug), shows "8001" port instead of "8003":

2025/02/03 09:36:27 INFO: [xds] [xds-resolver 0xc00033a360] Received error for Listener resource "upstream.host.foo": xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"

Please let me know if I am missing something or more information is required.

@arjan-bal
Copy link
Contributor

@axkum10 from the logs, it seems that grpc did initiate the xDS server fallback process. The fallback begins when the channel reports TRANSIENT_FAILURE. I suspect you're not using grpc.WaitForReady so the RPC is failing. Could you use the WaitForReady call option and see if things work?

_, err := c.SayHello(ctx, &pb.HelloRequest{Name: *name}, grpc.WaitForReady(true))

@easwars
Copy link
Contributor

easwars commented Feb 3, 2025

2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Connection to server xds.domain.com:8001-insecure-xds_v3 failed with error: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"

The above line tells me that the connection to the first server mentioned in the bootstrap has filed.

2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Attempting to initiate fallback after failure from server "xds.domain.com:8001-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Initiating fallback to server xds.domain.com:8003-insecure-xds_v3
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Received request for a reference to an xdsChannel for server config "xds.domain.com:8003-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Creating a new xdsChannel for server config "xds.domain.com:8003-insecure-xds_v3"
2025/02/03 09:36:27 INFO: [core] original dial target is: "xds.domain.com:8003"
2025/02/03 09:36:27 INFO: [core] [Channel #4]Channel created
2025/02/03 09:36:27 INFO: [core] [Channel #4]parsed dial target is: resolver.Target{URL:url.URL{Scheme:"dns", Opaque:"", User:(*url.Userinfo)(nil), Host:"", Path:"/xds.domain.com:8003", RawPath:"", OmitHost:false, ForceQuery:false, RawQuery:"", Fragment:"", RawFragment:""}}
2025/02/03 09:36:27 INFO: [core] [Channel #4]Channel authority set to "xds.domain.com:8003"

The above lines indicate that the xDS client has attempted fallback and has created a channel to the second server mentioned in the bootsttrap.

2025/02/03 09:36:27 INFO: [xds] [xds-resolver 0xc00033a360] Received error for Listener resource "upstream.host.foo": xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"
2025/02/03 09:36:27 WARNING: [core] [Channel #1]ccResolverWrapper: reporting error to cc: xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"

The above lines indicate that the xDS client is still trying to see if it can get a successful connection to the primary server, which is by design. The xDS client will always try to use the highest priority server that is available and therefore will keep retrying the connection to a higher priority server, if one exists.

2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] Adding authority "" to the set of interested authorities for channel [0xc0003631a0]
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [authority ""] Resubscribing to resource of type "ListenerResource" and name "upstream.host.foo"
2025/02/03 09:36:27 INFO: [xds] [xds-client 0xc000309260] [xds-channel 0xc0003631a0] [ads-stream 0xc000295380] Subscribing to resource "upstream.host.foo" of type "ListenerResource"

These lines indicate that xDS client is attempting to subscribe to the same resources from the secondary server.

All of the above are WAI.

I don't see the grpc channel to the secondary server Channel #4 in the your logs, ever transitioning to CONNECTING though. It does seem to exit IDLE. The timestamp on all your logs are from the same second.

Also, @arjan-bal
I don't think wait_for_ready on the client RPC should affect anything here. Since the xDS client has not yet successfully connected and received the resources that it is requesting, the xDS resolver should not have sent any state update on the grpc channel. Therefore client RPCs should still be queued. But if we are seeing that the xDS resolver propagates the error from the connection to the primary server, to the channel and client RPCs are failing, then that indicates a bug either in the xDS client fallback code or in the xDS resolver.

@axkum10 Could you please confirm if wait_for_ready is helping. Thanks

@axkum10
Copy link
Author

axkum10 commented Feb 4, 2025

Allow me to divide conversation in 2 parts, as it may help track an continue discussion on both.

A. @easwars your comment:

I don't see the grpc channel to the secondary server Channel #4 in the your logs, ever transitioning to CONNECTING though. It does seem to exit IDLE. The timestamp on all your logs are from the same second.

I do not see that as well (debug level is enabled) and program exits.
My aim is to have same resource names in both xDS server, so the client can fetch detail from one or the other. That is why you see xDS client asking for same resource. So I have 1st xDS Management (Mgmt) server with resource names set to run on port 8001, 2nd Mgmt server with same (as in 1st Mgmt) resource name running on port 8003. xDS server app (same code, same name) running on port 50001 & 50003. xDS client bootstrap file has both xDS Mgmt server 8001 (not running) & 8003 (running).
To provide HA, expectation is xDS client will go to xDS Mgmt running on port 8003 to get same set of resource as it would have done if xDS Mgmt at port 8001 would have been running.
Please let me know, if the expectation is wrong.

B. Suggestion from @arjan-bal to use grpc.WaitForReady moved the initialization. xDS client was able to get listerner & route information, then terminated after sending cluster info.
This seems wrong, as @easwars correctly pointed out. The 1st xDS Management (Mgmt) server mentioned in bootstrap file is not running, so this should have allowed xDS client to jump to 2nd xDS Mgmt server. Mentioning "WaitForReady" may not be the right choice, if expectation is to fail fast with multiple Mgmt mention in bootstrap file.

---- What I have tested so far ----

  1. If I just configure one xDS Mgmt server in xDS client bootstrap (xDS Mgmt on 8001 or 8003) works. Atleast this shows both xDS Mgmt server 8001 & 8003 are good.
  2. Now 2 xDS Mgmt server mentioned in xDS client bootstrap file. If top entry of xDS Mgmt server (either 8001 or 8003 port) is running, the xDS client is connecting fine.

Am I missing something.

@arjan-bal
Copy link
Contributor

@easwars the following log line lead me to believe the xDS resolver reported an error to the channel before producing a valid ResolverState.

2025/02/03 09:36:27 WARNING: [core] [Channel #1]ccResolverWrapper: reporting error to cc: xds: error received from xDS stream: failed to create an ADS stream: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:8001: connect: connection refused"

IIUC, this would put the channel in TRANSIENT_FAILURE due to the following code in gracefulswitch:

func (gsb *Balancer) ResolverError(err error) {
// The resolver data is only relevant to the most recent LB Policy.
balToUpdate := gsb.latestBalancer()
if balToUpdate == nil {
gsb.cc.UpdateState(balancer.State{
ConnectivityState: connectivity.TransientFailure,
Picker: base.NewErrPicker(err),
})
return
}
// Perform this call without gsb.mu to prevent deadlocks if the child calls
// back into the channel. The latest balancer can never be closed during a
// call from the channel, even without gsb.mu held.
balToUpdate.ResolverError(err)
}

Assigning to you since you've mentioned this may be a bug.

@easwars
Copy link
Contributor

easwars commented Feb 7, 2025

Just merged an update to the gRFC that describes the xDS fallback implementation in grpc. See: grpc/proposal#473

Will send a PR to fix the issue shortly. Thanks for filing this issue @axkum10.

@easwars
Copy link
Contributor

easwars commented Feb 11, 2025

Fixed by #8075

@easwars easwars closed this as completed Feb 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. Type: Bug
Projects
None yet
Development

No branches or pull requests

4 participants