Skip to content

Peer Storage (Part 3): Identifying Lost Channel States #3897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

adi2011
Copy link
Contributor

@adi2011 adi2011 commented Jun 28, 2025

In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.

The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.

Sorry for the delay!

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Jun 28, 2025

👋 Thanks for assigning @tnull as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@tnull tnull requested review from tnull and removed request for joostjager June 28, 2025 11:17
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from 101f31c to a35566a Compare June 29, 2025 05:03
adi2011 added 4 commits June 29, 2025 10:34
'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are
adding some fields separetly so that we do not need to read the whole ChannelMonitor
to identify if we have lost some states.

`PeerStorageMonitorHolderList` is used to keep the list of all the channels which would
be sent over the wire inside Peer Storage.
Create a utililty function to prevent code duplication while writing ChannelMonitors.

Serialise them inside ChainMonitor::send_peer_storage and send them over.
TODO: Peer storage should not cross 64k limit.
Deserialise the ChannelMonitors and compare the data to determine if we have
lost some states.
Node should now determine lost states using retrieved peer storage.
@adi2011 adi2011 force-pushed the peer-storage/serialise-deserialise branch from a35566a to 4c9f3c3 Compare June 29, 2025 05:04
Copy link

codecov bot commented Jun 29, 2025

Codecov Report

Attention: Patch coverage is 54.29864% with 101 lines in your changes missing coverage. Please review.

Project coverage is 88.86%. Comparing base (61a37b1) to head (4c9f3c3).

Files with missing lines Patch % Lines
lightning/src/chain/channelmonitor.rs 40.17% 8 Missing and 62 partials ⚠️
lightning/src/ln/channelmanager.rs 58.49% 22 Missing ⚠️
lightning/src/ln/our_peer_storage.rs 70.37% 0 Missing and 8 partials ⚠️
lightning/src/chain/chainmonitor.rs 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3897      +/-   ##
==========================================
- Coverage   88.86%   88.86%   -0.01%     
==========================================
  Files         165      165              
  Lines      118886   118962      +76     
  Branches   118886   118962      +76     
==========================================
+ Hits       105650   105710      +60     
- Misses      10911    10923      +12     
- Partials     2325     2329       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor without making the emergency-case ChannelMonitors all that different from the regular ones to induce more code changes across channelmonitor.rs? Obviously we should avoid counterparty_claimable_outpoints, but how much code is gonna break in doing so?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints (without the sources though) since the counterparty could broadcast a revoked commitment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

}
},
None => {
// TODO: Figure out if this channel is so old that we have forgotten about it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to worry here, I think. If the channel is gone we either haven't fallen behind (probably) or we already broadcasted a stale state (because we broadcast on startup if the channel is gone and we have a ChannelMonitor) at which point were screwed. So nothing to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying, I will remove this.

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor stub,

},

Err(e) => {
panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, let's move forward without blocking on the ChannelMonitor serialization stuff.

Comment on lines 209 to 223
impl Writeable for PeerStorageMonitorHolderList {
fn write<W: Writer>(&self, w: &mut W) -> Result<(), io::Error> {
encode_tlv_stream!(w, { (1, &self.monitors, required_vec) });
Ok(())
}
}

impl Readable for PeerStorageMonitorHolderList {
fn read<R: io::Read>(r: &mut R) -> Result<Self, DecodeError> {
let mut monitors: Option<Vec<PeerStorageMonitorHolder>> = None;
decode_tlv_stream!(r, { (1, monitors, optional_vec) });

Ok(PeerStorageMonitorHolderList { monitors: monitors.ok_or(DecodeError::InvalidValue)? })
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to replace both of these with a single impl_writeable_tlv_based macro call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks for this!

///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

let random_bytes = self.entropy_source.get_secure_random_bytes();
let serialised_channels = Vec::new();

// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next())

@@ -8807,6 +8808,7 @@ This indicates a bug inside LDK. Please report this error at https://github.com/
&self, peer_node_id: PublicKey, msg: msgs::PeerStorageRetrieval,
) -> Result<(), MsgHandleErrInternal> {
// TODO: Check if have any stale or missing ChannelMonitor.
let per_peer_state = self.per_peer_state.read().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to take the (read) lock at the top, do it after we decrypt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Contributor

@tnull tnull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a ~first pass.

This needs a rebase now, in particular now that #3922 landed.

@@ -810,10 +813,53 @@ where
}

fn send_peer_storage(&self, their_node_id: PublicKey) {
// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`.

static MAX_PEER_STORAGE_SIZE: usize = 65000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a const rather than static, I think? Also, would probably make sense to add this add the module level, with some docs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also isn't the max size 64 KiB, not 65K?

let random_bytes = self.entropy_source.get_secure_random_bytes();
let serialised_channels = Vec::new();
let random_usize = usize::from_le_bytes(random_bytes[0..8].try_into().unwrap());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the platform, a usize might not always be 8 bytes. You'll probably need to do

const USIZE_LEN: usize = core::mem::size_of::<usize>();

and use that instead of 8.

/// NOTE: `is_stub` is true only when we are using this to serialise for Peer Storage.
///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add rustfmt::skip when introducing new code. Might be good to introduce a commit in the beginning that removes the skip from fn write, so that the code changes here are in fact mostly code moves.

let mut curr_size = 0;

// Randomising Keys in the HashMap to fetch monitors without repetition.
let mut keys: Vec<&ChannelId> = monitors.keys().collect();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a bit cleaner by using the proposed iterator skiping approach in the loop below, maybe while simply keeping track of which monitors we already wrote?

let min_seen_secret = mon.monitor.get_min_seen_secret();
let counterparty_node_id = mon.monitor.get_counterparty_node_id();

match write_util(&mon.monitor.inner.lock().unwrap(), true, &mut ser_chan) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please move taking the lock out into a dedicated variable. This would also make it easier to spot the scoping of the lock, IMO.


match write_util(&mon.monitor.inner.lock().unwrap(), true, &mut ser_chan) {
Ok(_) => {
let mut ser_channel = Vec::new();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of creating a new Vec and then writing to it, you should be able to just call encode on the PeerStorageMonitorHolder.

But I'm currently confused what we use ser_channel to begin with is it just to calculate the length below? That seems like a big unnecessary allocation? You could use serialized_length for example and keep track of the written bytes and compare them to MAX_PEER_STORAGE_SIZE.

}

let mut serialised_channels = Vec::new();
monitors_list.write(&mut serialised_channels).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, just use encode.

monitors_list.monitors.push(peer_storage_monitor);
},
Err(_) => {
panic!("Can not write monitor for {}", mon.monitor.channel_id())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really, please avoid these explicit panics in any of this code.

///
/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
#[rustfmt::skip]
pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We often use an internal_ prefix or _internal suffix when splitting out functionality into util methods. This is no strict rule, but given the precedence it might be an easier-to-understand name.

let per_peer_state = self.per_peer_state.read().unwrap();

let mut cursor = io::Cursor::new(decrypted);
match <PeerStorageMonitorHolderList as Readable>::read(&mut cursor) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be cleaner to split the read call to its own line using Readable::read. Given that we don't want to panic, we can probably also avoid the huge match by using map_err or unwrap_or_else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants