Skip to content

Conversation

@karknu
Copy link
Contributor

@karknu karknu commented Dec 11, 2025

Description

This is a series a changes that makes the node more robust.

Checklist

Quality

  • Commit sequence makes sense and have useful messages, see ref.
  • New tests are added and existing tests are updated.
  • Self-reviewed the PR.

Maintenance

  • Linked an issue or added the PR to the current sprint of ouroboros-network project.
  • Added labels.
  • Updated changelog files.
  • The documentation has been properly updated, see ref.

@github-project-automation github-project-automation bot moved this to In Progress in Ouroboros Network Dec 11, 2025
@karknu karknu marked this pull request as ready for review December 11, 2025 12:15
@karknu karknu requested a review from a team as a code owner December 11, 2025 12:15
@karknu karknu added block-fetch Issues related to block fetch component. outbound-governor Issues / PRs related to outbound-governor chain-sync client labels Dec 11, 2025
Copy link
Collaborator

@coot coot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor suggestions.

-> Set peeraddr
-- ^ peers with failure
-> (peeraddr -> Bool)
-- ^ do we have to remember the peer?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-- ^ do we have to remember the peer?
-- ^ do we have to remember the fail count for a peer?

could you also add which peers are rememberd, e.g. local roots & extra root peers aka bootstrap peers.

unrelated rant

This is another reason why I think bootstrap peers should actually be part of ouroboros-network rather than cardano-diffusion, e.g. it's awkward for us do this part in cardano-diffusion - which would be the proper way in the current split between ouroboros-network and cardano-diffusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not committing the suggestion. The purpose of the function is to control if we can forget the peer or not. Not simply if we need to track its fail count.

Comment on lines 442 to 445
-> (Set peeraddr -> a)
-- ^ callback for forgotten peers
-> KnownPeers peeraddr
-> (KnownPeers peeraddr, a)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the callback is only used on the returned value so we can leave that to the caller and just return the forgotten peers.

Suggested change
-> (Set peeraddr -> a)
-- ^ callback for forgotten peers
-> KnownPeers peeraddr
-> (KnownPeers peeraddr, a)
-> KnownPeers peeraddr
-> (KnownPeers peeraddr, Set peeraddr)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Great suggestion, will make that change

} =
assert (all (`Map.member` allPeers) (Map.keysSet times)) $
let knownPeers' = knownPeers {
reportFailures :: Ord peeraddr
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more explicit name would be setConnectTimesAndFailCount

Copy link
Contributor Author

@karknu karknu Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real description would be setConnectionTimesAndFailCountAndPossiblyForgetPeers ;)

I prefer reportFailures to that, but I've added a comment to the function to make it clear what it is and what it does.

Enforce a maximum limit on the number of times we will attempt to
promote a peer to warm. Localroot peers, bootstrap relays and manually
configured public root peers are exempt from this limit.

The clearing of the reconnection counter is delayd until a connection
has managed to be active for a specific time (currently 120s).
Incase of an error use a shorter timeout when waiting for chainsync to
exit.
Exclude shutdown peers in active peers calculations.
It can take a while for peers to exit because blockfetch has to
sync with chainsync as it exits. But we shouldn't count those peers as
active or preferred anymore.
With p2p peerselection and the keepalive protocol we are not that
dependant on chainsync timeout for detecting bad upstream peers.

By bumping the timeout from between 135s and 269s to between 601s
and 911s we change the false positive rate from something that
happens a few times per epoch to something that happens less than
once in a decade.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

block-fetch Issues related to block fetch component. chain-sync client outbound-governor Issues / PRs related to outbound-governor

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants