Skip to content

Correctly handle lost MonitorEvents #3984

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

TheBlueMatt
Copy link
Collaborator

`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

This PR fixes lost MonitorEvents first in two commits which handle effectively replaying them on startup (first the preimage case then the timeout case) and then...

In previous commits we ensured that HTLC resolutions which came to
`ChannelManager` via a `MonitorEvent` were replayed on startup if
the `MonitorEvent` was lost. However, in cases where the
`ChannelManager` was so stale that it didn't have the payment state
for an HTLC at all, we only re-add it in cases where
`ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes
it.

Because constantly re-adding a payment state and then failing it
would generate lots of noise for users on startup (not to mention
risk of confusing stale payment events for the latest state of a
payment when the `PaymentId` has been reused to retry a payment).
Thus, `get_pending_or_resolved_outbound_htlcs` does not include
state for HTLCs which were resolved on chain with a preimage or
HTLCs which were resolved on chain with a timeout after
`ANTI_REORG_DELAY` confirmations.

This critera matches the critera for generating a `MonitorEvent`,
and works great under the assumption that `MonitorEvent`s are
reliably delivered. However, if they are not, and our
`ChannelManager` is lost or substantially old (or, in a future
where we do not persist `ChannelManager` at all), we will not end
up seeing payment resolution events for an HTLC.

Instead, we really want to tell our `ChannelMonitor`s when the
resolution of an HTLC is complete. Note that we don't particularly
care about non-payment HTLCs, as there is no re-hydration of state
to do there - `ChannelManager` load ignores forwarded HTLCs coming
back from `get_pending_or_resolved_outbound_htlcs` as there's
nothing to do - we always attempt to replay the success/failure and
figure out if it mattered based on whether there was still an HTLC
to claim/fail.

doing so via a new ChannelMonitorUpdate.

`test_dup_htlc_onchain_doesnt_fail_on_reload` made reference to
`ChainMonitor` persisting `ChannelMonitor`s on each new block,
which hasn't been the case in some time. Instead, we update the
comment and code to make explicit that it doesn't impact the test.
During testsing, we check that a `ChannelMonitor` will round-trip
through serialization exactly. However, we recently added a fix to
change a value in `PackageTemplate` on reload to fix some issues in
the field in 0.1. This can cause the round-trip tests to fail as a
field is modified during read.

We fix it here by simply exempting the field from the equality test
in the condition where it would be updated on read.

We also make the `ChannelMonitor` `PartialEq` trait implementation
non-public as weird workarounds like this make clear that such a
comparison is a britle API at best.
On `ChannelManager` reload we rebuild the pending outbound payments
list by looking for any missing payments in `ChannelMonitor`s.
However, in the same loop over `ChannelMonitor`s, we also re-claim
any pending payments which we see we have a payment preimage for.

In theory this can lead to a pending payment getting re-added and
re-claimed multiple times if it was sent as an MPP payment across
multiple channels. Worse, if it was claimed on one channel and
failed on another, in theory we could get both a `PaymentFailed`
and a `PaymentClaimed` event on startup for the same payment.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

When we restart and, during `ChannelManager` load, see a
`ChannelMonitor` for a closed channel, we scan it for preimages
that we passed to it and re-apply those to any pending or forwarded
payments. However, we didn't scan it for preimages it learned from
transactions on-chain. In cases where a `MonitorEvent` is lost,
this can lead to a lost preimage. Here we fix it by simply tracking
preimages we learned on-chain the same way we track preimages
picked up during normal channel operation.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

In a previous commit we handled the case of claimed HTLCs by
replaying payment preimages on startup to avoid `MonitorEvent` loss
causing us to miss an HTLC claim. Here we handle the HTLC-failed
case similarly.

Unlike with HTLC claims via preimage, we don't already have replay
logic in `ChannelManager` startup, but its easy enough to add one.
Luckily, we already track when an HTLC reaches permanently-failed
state in `ChannelMonitor` (i.e. it has `ANTI_REORG_DELAY`
confirmations on-chain on the failing transaction), so all we need
to do is add the ability to query for that and fail them on
`ChannelManager` startup.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

In previous commits we ensured that HTLC resolutions which came to
`ChannelManager` via a `MonitorEvent` were replayed on startup if
the `MonitorEvent` was lost. However, in cases where the
`ChannelManager` was so stale that it didn't have the payment state
for an HTLC at all, we only re-add it in cases where
`ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes
it.

Because constantly re-adding a payment state and then failing it
would generate lots of noise for users on startup (not to mention
risk of confusing stale payment events for the latest state of a
payment when the `PaymentId` has been reused to retry a payment).
Thus, `get_pending_or_resolved_outbound_htlcs` does not include
state for HTLCs which were resolved on chain with a preimage or
HTLCs which were resolved on chain with a timeout after
`ANTI_REORG_DELAY` confirmations.

This critera matches the critera for generating a `MonitorEvent`,
and works great under the assumption that `MonitorEvent`s are
reliably delivered. However, if they are not, and our
`ChannelManager` is lost or substantially old (or, in a future
where we do not persist `ChannelManager` at all), we will not end
up seeing payment resolution events for an HTLC.

Instead, we really want to tell our `ChannelMonitor`s when the
resolution of an HTLC is complete. Note that we don't particularly
care about non-payment HTLCs, as there is no re-hydration of state
to do there - `ChannelManager` load ignores forwarded HTLCs coming
back from `get_pending_or_resolved_outbound_htlcs` as there's
nothing to do - we always attempt to replay the success/failure and
figure out if it mattered based on whether there was still an HTLC
to claim/fail.

Here we take the first step towards that notification, adding a new
`ChannelMonitorUpdateStep` for the completion notification, and
tracking HTLCs which make it to the `ChannelMonitor` in such
updates in a new map.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

In previous commits we ensured that HTLC resolutions which came to
`ChannelManager` via a `MonitorEvent` were replayed on startup if
the `MonitorEvent` was lost. However, in cases where the
`ChannelManager` was so stale that it didn't have the payment state
for an HTLC at all, we only re-add it in cases where
`ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes
it.

Because constantly re-adding a payment state and then failing it
would generate lots of noise for users on startup (not to mention
risk of confusing stale payment events for the latest state of a
payment when the `PaymentId` has been reused to retry a payment).
Thus, `get_pending_or_resolved_outbound_htlcs` does not include
state for HTLCs which were resolved on chain with a preimage or
HTLCs which were resolved on chain with a timeout after
`ANTI_REORG_DELAY` confirmations.

This critera matches the critera for generating a `MonitorEvent`,
and works great under the assumption that `MonitorEvent`s are
reliably delivered. However, if they are not, and our
`ChannelManager` is lost or substantially old (or, in a future
where we do not persist `ChannelManager` at all), we will not end
up seeing payment resolution events for an HTLC.

Instead, we really want to tell our `ChannelMonitor`s when the
resolution of an HTLC is complete. Note that we don't particularly
care about non-payment HTLCs, as there is no re-hydration of state
to do there - `ChannelManager` load ignores forwarded HTLCs coming
back from `get_pending_or_resolved_outbound_htlcs` as there's
nothing to do - we always attempt to replay the success/failure and
figure out if it mattered based on whether there was still an HTLC
to claim/fail.

Here we prepare to generate the new
`ChannelMonitorUpdateStep::ReleasePaymentComplete` updates, adding
a new `PaymentCompleteUpdate` struct to track the new update before
we generate the `ChannelMonitorUpdate` and passing through to the
right places in `ChannelManager`.

The only cases where we want to generate the new update is after a
`PaymentSent` or `PaymentFailed` event when the event was the
result of a `MonitorEvent` or the equivalent read during startup.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

In previous commits we ensured that HTLC resolutions which came to
`ChannelManager` via a `MonitorEvent` were replayed on startup if
the `MonitorEvent` was lost. However, in cases where the
`ChannelManager` was so stale that it didn't have the payment state
for an HTLC at all, we only re-add it in cases where
`ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes
it.

Because constantly re-adding a payment state and then failing it
would generate lots of noise for users on startup (not to mention
risk of confusing stale payment events for the latest state of a
payment when the `PaymentId` has been reused to retry a payment).
Thus, `get_pending_or_resolved_outbound_htlcs` does not include
state for HTLCs which were resolved on chain with a preimage or
HTLCs which were resolved on chain with a timeout after
`ANTI_REORG_DELAY` confirmations.

This critera matches the critera for generating a `MonitorEvent`,
and works great under the assumption that `MonitorEvent`s are
reliably delivered. However, if they are not, and our
`ChannelManager` is lost or substantially old (or, in a future
where we do not persist `ChannelManager` at all), we will not end
up seeing payment resolution events for an HTLC.

Instead, we really want to tell our `ChannelMonitor`s when the
resolution of an HTLC is complete. Note that we don't particularly
care about non-payment HTLCs, as there is no re-hydration of state
to do there - `ChannelManager` load ignores forwarded HTLCs coming
back from `get_pending_or_resolved_outbound_htlcs` as there's
nothing to do - we always attempt to replay the success/failure and
figure out if it mattered based on whether there was still an HTLC
to claim/fail.

Here we begin generating the new
`ChannelMonitorUpdateStep::ReleasePaymentComplete` updates,
updating functional tests for the new `ChannelMonitorUpdate`s where
required.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

In previous commits we ensured that HTLC resolutions which came to
`ChannelManager` via a `MonitorEvent` were replayed on startup if
the `MonitorEvent` was lost. However, in cases where the
`ChannelManager` was so stale that it didn't have the payment state
for an HTLC at all, we only re-add it in cases where
`ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes
it.

Because constantly re-adding a payment state and then failing it
would generate lots of noise for users on startup (not to mention
risk of confusing stale payment events for the latest state of a
payment when the `PaymentId` has been reused to retry a payment).
Thus, `get_pending_or_resolved_outbound_htlcs` does not include
state for HTLCs which were resolved on chain with a preimage or
HTLCs which were resolved on chain with a timeout after
`ANTI_REORG_DELAY` confirmations.

This critera matches the critera for generating a `MonitorEvent`,
and works great under the assumption that `MonitorEvent`s are
reliably delivered. However, if they are not, and our
`ChannelManager` is lost or substantially old (or, in a future
where we do not persist `ChannelManager` at all), we will not end
up seeing payment resolution events for an HTLC.

Instead, we really want to tell our `ChannelMonitor`s when the
resolution of an HTLC is complete. Note that we don't particularly
care about non-payment HTLCs, as there is no re-hydration of state
to do there - `ChannelManager` load ignores forwarded HTLCs coming
back from `get_pending_or_resolved_outbound_htlcs` as there's
nothing to do - we always attempt to replay the success/failure and
figure out if it mattered based on whether there was still an HTLC
to claim/fail.

Here we, finally, begin actually using the new
`ChannelMonitorUpdateStep::ReleasePaymentComplete` updates,
skipping re-hydration of pending payments once they have been fully
resolved through to a user `Event`.
When a payment was sent and ultimately completed through an
on-chain HTLC claim which we discover during startup, we
deliberately break the payment tracking logic to keep it around
forever, declining to send a `PaymentPathSuccessful` event but
ensuring that we don't constantly replay the claim on every
startup.

However, now that we now have logic to complete a claim by marking
it as completed in a `ChannelMonitor` and not replaying information
about the claim on every startup. Thus, we no longer need to take
the conservative stance and can correctly replay claims now,
generating `PaymentPathSuccessful` events and allowing the state to
be removed.
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Aug 2, 2025

I've assigned @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

Copy link

codecov bot commented Aug 2, 2025

Codecov Report

❌ Patch coverage is 92.52782% with 47 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.94%. Comparing base (61e5819) to head (037a1d7).

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 86.17% 21 Missing and 5 partials ⚠️
lightning/src/ln/monitor_tests.rs 95.33% 5 Missing and 6 partials ⚠️
lightning/src/chain/package.rs 85.18% 4 Missing ⚠️
lightning/src/ln/payment_tests.rs 94.02% 2 Missing and 2 partials ⚠️
lightning/src/chain/channelmonitor.rs 96.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3984      +/-   ##
==========================================
- Coverage   88.94%   88.94%   -0.01%     
==========================================
  Files         174      174              
  Lines      124201   124707     +506     
  Branches   124201   124707     +506     
==========================================
+ Hits       110472   110920     +448     
- Misses      11251    11290      +39     
- Partials     2478     2497      +19     
Flag Coverage Δ
fuzzing 22.58% <4.41%> (-0.06%) ⬇️
tests 88.77% <92.52%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants