Peer Storage (Part 3): Identifying Lost Channel States #3897

adi2011 · 2025-06-28T10:21:48Z

In this PR, we begin serializing the ChannelMonitors and sending them over to determine whether any states were lost upon retrieval.

The next PR will be the final one, where we use FundRecoverer to initiate a force close and potentially go on-chain using a penalty transaction.

Sorry for the delay!

ldk-reviews-bot · 2025-06-28T10:21:51Z

👋 Thanks for assigning @tnull as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

codecov · 2025-06-29T05:13:45Z

Codecov Report

❌ Patch coverage is 59.57447% with 76 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.69%. Comparing base (ff279d6) to head (6328b33).
⚠️ Report is 50 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/chain/channelmonitor.rs	50.35%	12 Missing and 57 partials ⚠️
lightning/src/ln/channelmanager.rs	85.71%	5 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3897      +/-   ##
==========================================
- Coverage   88.91%   88.69%   -0.23%     
==========================================
  Files         173      173              
  Lines      123393   123735     +342     
  Branches   123393   123735     +342     
==========================================
+ Hits       109717   109741      +24     
- Misses      11216    11624     +408     
+ Partials     2460     2370      -90

Flag	Coverage Δ
fuzzing	`?`
tests	`88.69% <59.57%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheBlueMatt · 2025-06-30T01:05:52Z

lightning/src/chain/channelmonitor.rs

+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]
+pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {


@wpaulino what do you think we should reasonably cut here to reduce the size of a ChannelMonitor without making the emergency-case ChannelMonitors all that different from the regular ones to induce more code changes across channelmonitor.rs? Obviously we should avoid counterparty_claimable_outpoints, but how much code is gonna break in doing so?

Not too familiar with the goals here, but if the idea is for the emergency-case ChannelMonitor to be able to recover funds, wouldn't it need to handle a commitment confirmation from either party? That means we need to track most things, even counterparty_claimable_outpoints (without the sources though) since the counterparty could broadcast a revoked commitment.

Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

Yes, eventually we'll need to figure out what can be stripped from ChannelMonitors. In CLN, we use a separate struct that holds only the minimal data needed to reconstruct a ChannelMonitor with just the essential information to grab funds or penalise the peer.

lightning/src/ln/channelmanager.rs

ldk-reviews-bot · 2025-06-30T11:18:09Z

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

tnull

Took a first look, but will hold off with going more into details until we decided on which way we should go with the ChannelMonitor stub,

tnull · 2025-06-30T12:21:17Z

lightning/src/ln/channelmanager.rs

+			},
+
+			Err(e) => {
+				panic!("Wrong serialisation of PeerStorageMonitorHolderList: {}", e);


I don't think we should ever panic in any of this code. Yes, something might be wrong if we have peer storage data we can't read anymore, but really no reason to refuse to at least keep other potential channels operational.

Yes, that makes sense, I think we should only panic if we have determined that we have lost some channel state.

TheBlueMatt

Few more comments, let's move forward without blocking on the ChannelMonitor serialization stuff.

lightning/src/ln/our_peer_storage.rs

TheBlueMatt · 2025-07-09T22:25:29Z

lightning/src/chain/channelmonitor.rs

+///
+/// [`ChainMonitor`]: crate::chain::chainmonitor::ChainMonitor
+#[rustfmt::skip]
+pub(crate) fn write_util<Signer: EcdsaChannelSigner, W: Writer>(channel_monitor: &ChannelMonitorImpl<Signer>, is_stub: bool, writer: &mut W) -> Result<(), Error> {


Basically. I think ideally we find a way to store everything (required) but counterparty_claimable_outpoints so that we can punish the counterparty on their balance+reserve if they broadcast a stale state, even if not HTLCs (though of course they can't claim the HTLCs without us being able to punish them on the next stage). Not sure how practical that is today without counterparty_claimable_outpoints but I think that's the goal.

@adi2011 maybe for now let's just write the full monitors, but leave a TODO to strip out what we can later. For larger nodes that means all our monitors will be too large and we'll never back any up but that's okay.

TheBlueMatt · 2025-07-09T22:26:56Z

lightning/src/chain/chainmonitor.rs

 		let random_bytes = self.entropy_source.get_secure_random_bytes();
-		let serialised_channels = Vec::new();
+
+		// TODO(aditya): Choose n random channels so that peer storage does not exceed 64k.


This should be pretty easy? We have random bytes, just make an outer loop that selects a random monitor (by doing monitors.iter().skip(random_usize % monitors.len()).next())

lightning/src/ln/channelmanager.rs

tnull

Did a ~first pass.

This needs a rebase now, in particular now that #3922 landed.

tnull · 2025-07-11T08:40:19Z

lightning/src/chain/chainmonitor.rs

@@ -810,10 +813,53 @@ where
 	}

 	fn send_peer_storage(&self, their_node_id: PublicKey) {
-		// TODO: Serialize `ChannelMonitor`s inside `our_peer_storage`.
-
+		static MAX_PEER_STORAGE_SIZE: usize = 65000;


This should be a const rather than static, I think? Also, would probably make sense to add this add the module level, with some docs.

Also isn't the max size 64 KiB, not 65K?

Oh, my bad, thanks for pointing this out, It should be 65531.

lightning/src/chain/chainmonitor.rs

lightning/src/chain/channelmonitor.rs

tnull · 2025-07-11T08:52:19Z

lightning/src/chain/chainmonitor.rs

+		let mut curr_size = 0;
+
+		// Randomising Keys in the HashMap to fetch monitors without repetition.
+		let mut keys: Vec<&ChannelId> = monitors.keys().collect();


Can we make this a bit cleaner by using the proposed iterator skiping approach in the loop below, maybe while simply keeping track of which monitors we already wrote?

I will address this in the next fixup.

Seems this is still unaddressed?

Gentle ping here.

Created a BTree to keep track of the monitors and ran a while loop until we either ran out of peer storage capacity or all monitors were stored.

lightning/src/chain/chainmonitor.rs

tnull · 2025-07-11T09:02:55Z

lightning/src/chain/chainmonitor.rs

+					monitors_list.monitors.push(peer_storage_monitor);
+				},
+				Err(_) => {
+					panic!("Can not write monitor for {}", mon.monitor.channel_id())


Really, please avoid these explicit panics in any of this code.

This would panic only if there is some issue with the write_util which suggests that we are unable to serialise the channelmonitor, should we just log the error here instead?

Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

lightning/src/chain/channelmonitor.rs

lightning/src/ln/channelmanager.rs

tnull · 2025-07-17T07:08:51Z

lightning/src/chain/chainmonitor.rs

 		let random_bytes = self.entropy_source.get_secure_random_bytes();
-		let serialised_channels = Vec::new();
+		let random_usize = usize::from_le_bytes(random_bytes[0..core::mem::size_of::<usize>()].try_into().unwrap());


nit: Let's keep the length in a separate const, avoiding to make this line overly long/noisy.

Fixed, thanks!

tnull · 2025-07-17T07:11:38Z

lightning/src/chain/chainmonitor.rs

+					monitors_list.monitors.push(peer_storage_monitor);
+				},
+				Err(_) => {
+					panic!("Can not write monitor for {}", mon.monitor.channel_id())


Mhh, while I'd still prefer to never panic in any of the peer storage code, indeed this would likely only get hit if we run out of memory or similar. If we keep it, let's just expect on the write_internal call, which avoids all this matching.

tnull · 2025-07-17T07:15:16Z

lightning/src/chain/chainmonitor.rs

+			match write_util_internal(&chan_mon, true, &mut ser_chan) {
+				Ok(_) => {
+					// Adding size of peer_storage_monitor.
+					curr_size += ser_chan.0.serialized_length()


If we move this check below constructing PeerStorageMonitorHolder, can we just use it's serialized_lenght implementation instead of tallying up the individual fields here?

My bad, thanks for pointing this out. Fixed.

tnull · 2025-07-17T07:18:22Z

lightning/src/ln/channelmanager.rs

+				},
+				None => {
+					// TODO: Figure out if this channel is so old that we have forgotten about it.
+					panic!("Lost a channel {}", &mon_holder.channel_id);


Why panic here? This should be the common case, no?

We need this when we’ve completely forgotten about a channel but it’s still active. The only false positives occur if we’ve closed a channel and then forgotten about it. Any ideas on how we could detect those?

We need this when we’ve completely forgotten about a channel but it’s still active.

I don't think this is accurate? Wouldn't we end up here simply in this scenario:

We have a channel, store an encrypted backup with a peer

The peer goes offline

We close the channel, drop it from the per_peer_state

The peer comes back online

We retrieve and decrypt a peer backup with a channel state that we no longer know anything about.

We panic and crash the node

This can happen very easily / is the common case, no? Also note this same flow could intentionally get abused by a malicious peer, too. The counterparty would simply need to guess that we include a certain channel in the stored blob, remember it, wait until the channel is closed, and then simply replay the old backup to crash our node.

I still maintain we should never panic in any of this PeerStorage related code, maybe mod the above case where we're positive that our local state is outdated compared to the backed-up version.

That said, we'll indeed need to discern between the two cases (unexpected data loss / expected data loss). One variant to do this would be to first check whether we see a funding transaction spend, although note that this would be an async flow that we can't just directly initiate here. Or we need to maintain a list of all closed and remember it forever. The latter might indeed be the simplest and safest way, if we're fine with keeping the additional data. In any case, we need to find a solution here before moving on, as panicking here is unacceptable.

Yes, I agree, this is a false positive where the node crashes, but it shouldn’t. The happy path here is:

We have a channel with a peer.

Something happens and we lose the channel (but it is unclosed); the node cannot identify it either.

We retrieve the ChannelMonitor from peer storage.

We realise that this is an unknown, old channel, and the node crashes because it no longer has any record of an active channel.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

Yes, it should not be there. I put up the TODO so that we can have a discussion around this.
Changing the panic to log_debug.

tnull · 2025-07-17T07:19:48Z

lightning/src/ln/our_peer_storage.rs

+
+impl_writeable_tlv_based!(PeerStorageMonitorHolder, {
+	(0, channel_id, required),
+	(1, counterparty_node_id, required),


For new objects, you should be able to make all fields even-numbered as they are required from the getgo, basically.

tnull · 2025-07-17T07:27:29Z

lightning/src/ln/our_peer_storage.rs

+/// wrapped inside [`PeerStorage`].
+///
+/// [`PeerStorage`]: crate::ln::msgs::PeerStorage
+pub(crate) struct PeerStorageMonitorHolderList {


I think we should be able to avoid this new type wrapper if we add a impl_writeable_for_vec entry for PeerStorageMonitorHolder in lightning/src/util/ser.rs.

Seems this is still unaddressed?

Yes, sorry, I will address this in the next fixup.

lightning/src/chain/channelmonitor.rs

ldk-reviews-bot · 2025-07-19T15:12:34Z

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

TheBlueMatt

Looks like you forgot to push after addressing @tnull's last comments? Also needs a rebase.

adi2011 · 2025-07-21T13:15:18Z

Oh sorry, my bad, I have pushed the fixups and rebased.

TheBlueMatt

I think if we just remove the is_stub check and the spurious panic we can land this (feel free to squash fixups, IMO). Sadly we will need to add a cfg flag before 0.2 (unless we can nail down the serialization format we want), but it doesn't have to hold up this PR or progress.

TheBlueMatt · 2025-07-22T12:32:58Z

lightning/src/chain/channelmonitor.rs


-		self.lockdown_from_offchain.write(writer)?;
-		self.holder_tx_signed.write(writer)?;
+	if !is_stub {


For now lets not do anything differently here, I don't think skipping the onchain_tx_handler is safe, but we're not quite sure what the right answer is yet. Because we won't (yet) finalize the serialization format, we should also cfg-tag out the actual send logic in send_peer_storage, sadly.

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

Also, should I make a new cfg tag for peer-storage or use an existing?

Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

Cool, I will cfg-tag the sending logic. In the next PR we can discuss when to send peer storage along with what to omit from ChannelMonitors.

TheBlueMatt · 2025-07-22T12:38:35Z

lightning/src/ln/channelmanager.rs

+				},
+				None => {
+					// TODO: Figure out if this channel is so old that we have forgotten about it.
+					panic!("Lost a channel {}", &mon_holder.channel_id);


Yes, I agree, this is a false positive where the node crashes

We cannot tolerate a false positive if we're going to panic.

There should be a way to determine whether a channel is closed, so that if we receive an unknown channel and its funding TXO is unspent, we crash the node for recovery. But if the funding is already spent, we shouldn’t worry unless they have published an old state. Right?

There is no way to do this in LDK today, and I don't think we'll ever get one. We don't have some global way to query if an arbitrary UTXO is unspent (well, we do in gossip but its not always available), and once a channel is closed and the funds swept we may delete the ChannelMonitor because there's no reason to keep it around. Keeping around information about every past channel we've ever had just to detect a lost channel here doesn't seem worth it to me.

It does mean we won't detect if we lose all of our state, but I think that's okay - a user will notice that their entire balance is missing :)

'PeerStorageMonitorHolder' is used to wrap a single ChannelMonitor, here we are adding some fields separetly so that we do not need to read the whole ChannelMonitor to identify if we have lost some states. `PeerStorageMonitorHolderList` is used to keep the list of all the channels which would be sent over the wire inside Peer Storage.

Fixed formatting for write() in ChannelMonitorImpl. This would make the next commit cleaner by ensuring it only contains direct code shifts, without unrelated formatting changes.

TheBlueMatt

I think this is fine now, after the notes here are fixed and rustfmt is fixed.

TheBlueMatt · 2025-07-24T16:38:09Z

lightning/src/chain/channelmonitor.rs


-		self.lockdown_from_offchain.write(writer)?;
-		self.holder_tx_signed.write(writer)?;
+	if !is_stub {


Should I not use is_stub, inside the function or should I remove the parameter from the function definition.

I think its fine to add it with a TODO: figure out which fields should go here and which do not.

Also, should I make a new cfg tag for peer-storage or use an existing?

I believe a new one (you'll have to add it in ci/ci-tests.sh as well)

Create a utililty function to prevent code duplication while writing ChannelMonitors. Serialise them inside ChainMonitor::send_peer_storage and send them over. Cfg-tag the sending logic because we are unsure of what to omit from ChannelMonitors stored inside peer-storage.

Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.

Node should now determine lost states using retrieved peer storage.

ldk-reviews-bot · 2025-07-26T11:21:32Z

🔔 1st Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-07-28T11:22:08Z

🔔 2nd Reminder

Hey @tnull! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

tnull · 2025-07-28T11:56:25Z

Cargo.toml

@@ -68,4 +68,5 @@ check-cfg = [
    "cfg(splicing)",
    "cfg(async_payments)",
    "cfg(simple_close)",
+	"cfg(peer_storage)",


nit: Indentation is off.

Sorry, need to fix my indentation configuration. Fixed.

tnull · 2025-07-28T11:57:08Z

lightning/src/chain/chainmonitor.rs

@@ -47,6 +48,8 @@ use crate::types::features::{InitFeatures, NodeFeatures};
 use crate::util::errors::APIError;
 use crate::util::logger::{Logger, WithContext};
 use crate::util::persist::MonitorName;
+#[allow(unused_imports)]


nit: Please rather cfg-gate these imports, too.

tnull · 2025-07-28T12:00:04Z

lightning/src/chain/chainmonitor.rs

 		let random_bytes = self.entropy_source.get_secure_random_bytes();
-		let serialised_channels = Vec::new();
+
+		#[cfg(peer_storage)]


Why not cfg-gate the whole message then? This way it might get really confusing what is part and what isn't part of the cfg flag?

Ok, for now, cfg-tagging fn send_peer_storage and all the call sites of this function.

tnull · 2025-07-28T12:09:12Z

lightning/src/chain/chainmonitor.rs

+		{
+			const MAX_PEER_STORAGE_SIZE: usize = 65531;
+			const USIZE_LEN: usize = core::mem::size_of::<usize>();
+			let random_usize = usize::from_le_bytes(random_bytes[0..USIZE_LEN].try_into().unwrap());


nit: Let's just avoid the unnecessary unwrap here:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs index 1681d0187..ba7613f8e 100644 --- a/lightning/src/chain/chainmonitor.rs +++ b/lightning/src/chain/chainmonitor.rs @@ -816,12 +816,13 @@ where #[allow(unused_mut)] let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new(); let random_bytes = self.entropy_source.get_secure_random_bytes(); - #[cfg(peer_storage)] { const MAX_PEER_STORAGE_SIZE: usize = 65531; const USIZE_LEN: usize = core::mem::size_of::<usize>(); - let random_usize = usize::from_le_bytes(random_bytes[0..USIZE_LEN].try_into().unwrap()); + let mut usize_bytes = [0u8; USIZE_LEN]; + usize_bytes.copy_from_slice(&random_bytes); + let random_usize = usize::from_le_bytes(usize_bytes); let mut curr_size = 0; let monitors = self.monitors.read().unwrap(); // Randomising Keys in the HashMap to fetch monitors without repetition.

Thanks for this cleanup. Fixed.

tnull · 2025-07-28T12:10:11Z

lightning/src/chain/chainmonitor.rs

+		let mut curr_size = 0;
+
+		// Randomising Keys in the HashMap to fetch monitors without repetition.
+		let mut keys: Vec<&ChannelId> = monitors.keys().collect();


Gentle ping here.

tnull · 2025-07-28T12:12:28Z

lightning/src/chain/chainmonitor.rs

+				let mut ser_chan = VecWriter(Vec::new());
+				let min_seen_secret = mon.monitor.get_min_seen_secret();
+				let counterparty_node_id = mon.monitor.get_counterparty_node_id();
+				let chan_mon = mon.monitor.inner.lock().unwrap();


Can we put this lock and the write call into a separate scope so we drop the lock ASAP again?

Yes, thanks for pointing this out. Fixed.

tnull · 2025-07-28T12:16:42Z

lightning/src/ln/channelmanager.rs

@@ -17117,38 +17165,62 @@ mod tests {

 	#[test]
 	#[rustfmt::skip]
+	#[cfg(peer_storage)]
+	#[should_panic(expected = "Lost channel state for channel ae3367da2c13bc1ceb86bf56418f62828f7ce9d6bfb15a46af5ba1f1ed8b124f.\n\


Rather than using should_panic here (where we can't discern when it's actually happening), can we rather use std::panic::catch_unwind to catch the specific panic on the specific call and then using assert to check it's an/the correct error?

The issue is, if I remove the should_panic the test panics during teardown (in the drop), not in the main test logic.

tnull · 2025-07-28T12:17:03Z

lightning/src/ln/channelmanager.rs

 			if let MessageSendEvent::SendChannelReestablish { ref node_id, ref msg } = msg {
 				nodes[0].node.handle_channel_reestablish(nodes[1].node.get_our_node_id(), msg);
 				assert_eq!(*node_id, nodes[0].node.get_our_node_id());
 			} else if let MessageSendEvent::SendPeerStorageRetrieval { ref node_id, ref msg } = msg {
+				// Should Panic here!


Let's check / assert it here then, see above.

tnull · 2025-07-28T12:18:25Z

lightning/src/ln/channelmanager.rs

 		nodes[0].node.peer_disconnected(nodes[1].node.get_our_node_id());
 		nodes[1].node.peer_disconnected(nodes[0].node.get_our_node_id());

+		// Reload Node!
+		// nodes[0].chain_source.clear_watched_txn_and_outputs();


Seems this is still uncommented out? Do we need clear_watched_txn_and_outputs at all then?

No, since we are not covering missing channelmonitors yet.

Right, then it might be preferable to just leave a TODO and drop the unused clear_watched_txn_and_outputs from this PR for now.

graphite-app · 2025-07-29T12:07:51Z

lightning/src/chain/chainmonitor.rs

+		while curr_size < MAX_PEER_STORAGE_SIZE
+			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		{
+			let idx = random_usize % monitors.len();
+			stored_chanmon_idx.insert(idx + 1);


The loop condition has a potential issue where it might not terminate if the same random index is repeatedly selected. Since stored_chanmon_idx is a BTreeSet, inserting the same value multiple times doesn't increase its size. If random_usize % monitors.len() keeps generating the same index, the condition *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() will remain true indefinitely.

Consider modifying the approach to ensure termination, such as:

Using a counter to limit iterations

Tracking already-processed indices in a separate set

Using a deterministic sequence instead of random selection

This would prevent potential infinite loops while still achieving the goal of selecting monitors for peer storage.

Suggested change

while curr_size < MAX_PEER_STORAGE_SIZE

&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()

{

let idx = random_usize % monitors.len();

stored_chanmon_idx.insert(idx + 1);

while curr_size < MAX_PEER_STORAGE_SIZE

&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()

&& stored_chanmon_idx.len() < monitors.len()

{

let idx = random_usize % monitors.len();

let inserted = stored_chanmon_idx.insert(idx + 1);

if inserted {

curr_size += 1;

} else {

// If we couldn't insert, try a different index next time

random_usize = random_usize.wrapping_add(1);

}

Spotted by Diamond

Is this helpful? React 👍 or 👎 to let us know.

@tnull, this is the same concern that I had in my mind as well. Wasn't the previous approach more efficient and secure?

How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs index 4a40ba872..dde117ca9 100644 --- a/lightning/src/chain/chainmonitor.rs +++ b/lightning/src/chain/chainmonitor.rs @@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput}; use crate::events::{self, Event, EventHandler, ReplayEvent}; use crate::ln::channel_state::ChannelDetails; use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler}; +#[cfg(peer_storage)] use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder}; use crate::ln::types::ChannelId; use crate::prelude::*; @@ -53,6 +54,8 @@ use crate::util::persist::MonitorName; use crate::util::ser::{VecWriter, Writeable}; use crate::util::wakers::{Future, Notifier}; use bitcoin::secp256k1::PublicKey; +#[cfg(peer_storage)] +use core::iter::Cycle; use core::ops::Deref; use core::sync::atomic::{AtomicUsize, Ordering}; @@ -808,6 +811,7 @@ where /// This function collects the counterparty node IDs from all monitors into a `HashSet`, /// ensuring unique IDs are returned. + #[cfg(peer_storage)] fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> { let mon = self.monitors.read().unwrap(); mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect() @@ -815,51 +819,71 @@ where #[cfg(peer_storage)] fn send_peer_storage(&self, their_node_id: PublicKey) { - #[allow(unused_mut)] let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new(); let random_bytes = self.entropy_source.get_secure_random_bytes(); const MAX_PEER_STORAGE_SIZE: usize = 65531; const USIZE_LEN: usize = core::mem::size_of::<usize>(); - let mut usize_bytes = [0u8; USIZE_LEN]; - usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]); - let random_usize = usize::from_le_bytes(usize_bytes); + let mut random_bytes_cycle_iter = random_bytes.iter().cycle(); + + let mut current_size = 0; + let monitors_lock = self.monitors.read().unwrap(); + let mut channel_ids = monitors_lock.keys().copied().collect(); + + fn next_random_id( + channel_ids: &mut Vec<ChannelId>, + random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>, + ) -> Option<ChannelId> { + if channel_ids.is_empty() { + return None; + } - let mut curr_size = 0; - let monitors = self.monitors.read().unwrap(); - let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new(); - // Used as a fallback reference if the set is empty - let zero = 0; + let random_idx = { + let mut usize_bytes = [0u8; USIZE_LEN]; + usize_bytes.iter_mut().for_each(|b| { + *b = *random_bytes_cycle_iter.next().expect("A cycle never ends") + }); + // Take one more to introduce a slight misalignment. + random_bytes_cycle_iter.next().expect("A cycle never ends"); + usize::from_le_bytes(usize_bytes) % channel_ids.len() + }; + + Some(channel_ids.swap_remove(random_idx)) + } - while curr_size < MAX_PEER_STORAGE_SIZE - && *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() + while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter) { - let idx = random_usize % monitors.len(); - stored_chanmon_idx.insert(idx + 1); - let (cid, mon) = monitors.iter().skip(idx).next().unwrap(); + let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) { + monitor_holder + } else { + debug_assert!( + false, + "Tried to access non-existing monitor, this should never happen" + ); + break; + }; - let mut ser_chan = VecWriter(Vec::new()); - let min_seen_secret = mon.monitor.get_min_seen_secret(); - let counterparty_node_id = mon.monitor.get_counterparty_node_id(); + let mut serialized_channel = VecWriter(Vec::new()); + let min_seen_secret = monitor_holder.monitor.get_min_seen_secret(); + let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id(); { - let chan_mon = mon.monitor.inner.lock().unwrap(); + let inner_lock = monitor_holder.monitor.inner.lock().unwrap(); - write_chanmon_internal(&chan_mon, true, &mut ser_chan) + write_chanmon_internal(&inner_lock, true, &mut serialized_channel) .expect("can not write Channel Monitor for peer storage message"); } let peer_storage_monitor = PeerStorageMonitorHolder { - channel_id: *cid, + channel_id, min_seen_secret, counterparty_node_id, - monitor_bytes: ser_chan.0, + monitor_bytes: serialized_channel.0, }; - // Adding size of peer_storage_monitor. - curr_size += peer_storage_monitor.serialized_length(); - - if curr_size > MAX_PEER_STORAGE_SIZE { - break; + current_size += peer_storage_monitor.serialized_length(); + if current_size > MAX_PEER_STORAGE_SIZE { + continue; } + monitors_list.push(peer_storage_monitor); }

There might be still some room for improvement, but IMO something like this would be a lot more readable.

+ if current_size > MAX_PEER_STORAGE_SIZE { + continue; }

Why not break here?

Say we have one huge monitor that doesn't fit in the backup and a few smaller ones. If we'd always abort whenever we draw the large monitor, we might unnecessarily skip backups for the smaller ones. Yes, we could get fancier around how/which monitors to include to make the best use out of the space given, but for now it seems fine to just walk the entire list?

Ah, but now looking at it again I spot a bug there, it should be:

let serialized_length = peer_storage_monitor.serialized_length(); if current_size + serialized_length > MAX_PEER_STORAGE_SIZE { continue; } else { current_size += serialized_length; monitors_list.push(peer_storage_monitor); }

tnull · 2025-07-31T11:41:10Z

lightning/src/chain/chainmonitor.rs

+		while curr_size < MAX_PEER_STORAGE_SIZE
+			&& *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len()
+		{
+			let idx = random_usize % monitors.len();
+			stored_chanmon_idx.insert(idx + 1);


How about something along these lines?:

diff --git a/lightning/src/chain/chainmonitor.rs b/lightning/src/chain/chainmonitor.rs index 4a40ba872..dde117ca9 100644 --- a/lightning/src/chain/chainmonitor.rs +++ b/lightning/src/chain/chainmonitor.rs @@ -39,6 +39,7 @@ use crate::chain::{ChannelMonitorUpdateStatus, Filter, WatchedOutput}; use crate::events::{self, Event, EventHandler, ReplayEvent}; use crate::ln::channel_state::ChannelDetails; use crate::ln::msgs::{self, BaseMessageHandler, Init, MessageSendEvent, SendOnlyMessageHandler}; +#[cfg(peer_storage)] use crate::ln::our_peer_storage::{DecryptedOurPeerStorage, PeerStorageMonitorHolder}; use crate::ln::types::ChannelId; use crate::prelude::*; @@ -53,6 +54,8 @@ use crate::util::persist::MonitorName; use crate::util::ser::{VecWriter, Writeable}; use crate::util::wakers::{Future, Notifier}; use bitcoin::secp256k1::PublicKey; +#[cfg(peer_storage)] +use core::iter::Cycle; use core::ops::Deref; use core::sync::atomic::{AtomicUsize, Ordering}; @@ -808,6 +811,7 @@ where /// This function collects the counterparty node IDs from all monitors into a `HashSet`, /// ensuring unique IDs are returned. + #[cfg(peer_storage)] fn all_counterparty_node_ids(&self) -> HashSet<PublicKey> { let mon = self.monitors.read().unwrap(); mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect() @@ -815,51 +819,71 @@ where #[cfg(peer_storage)] fn send_peer_storage(&self, their_node_id: PublicKey) { - #[allow(unused_mut)] let mut monitors_list: Vec<PeerStorageMonitorHolder> = Vec::new(); let random_bytes = self.entropy_source.get_secure_random_bytes(); const MAX_PEER_STORAGE_SIZE: usize = 65531; const USIZE_LEN: usize = core::mem::size_of::<usize>(); - let mut usize_bytes = [0u8; USIZE_LEN]; - usize_bytes.copy_from_slice(&random_bytes[0..USIZE_LEN]); - let random_usize = usize::from_le_bytes(usize_bytes); + let mut random_bytes_cycle_iter = random_bytes.iter().cycle(); + + let mut current_size = 0; + let monitors_lock = self.monitors.read().unwrap(); + let mut channel_ids = monitors_lock.keys().copied().collect(); + + fn next_random_id( + channel_ids: &mut Vec<ChannelId>, + random_bytes_cycle_iter: &mut Cycle<core::slice::Iter<u8>>, + ) -> Option<ChannelId> { + if channel_ids.is_empty() { + return None; + } - let mut curr_size = 0; - let monitors = self.monitors.read().unwrap(); - let mut stored_chanmon_idx = alloc::collections::BTreeSet::<usize>::new(); - // Used as a fallback reference if the set is empty - let zero = 0; + let random_idx = { + let mut usize_bytes = [0u8; USIZE_LEN]; + usize_bytes.iter_mut().for_each(|b| { + *b = *random_bytes_cycle_iter.next().expect("A cycle never ends") + }); + // Take one more to introduce a slight misalignment. + random_bytes_cycle_iter.next().expect("A cycle never ends"); + usize::from_le_bytes(usize_bytes) % channel_ids.len() + }; + + Some(channel_ids.swap_remove(random_idx)) + } - while curr_size < MAX_PEER_STORAGE_SIZE - && *stored_chanmon_idx.last().unwrap_or(&zero) < monitors.len() + while let Some(channel_id) = next_random_id(&mut channel_ids, &mut random_bytes_cycle_iter) { - let idx = random_usize % monitors.len(); - stored_chanmon_idx.insert(idx + 1); - let (cid, mon) = monitors.iter().skip(idx).next().unwrap(); + let monitor_holder = if let Some(monitor_holder) = monitors_lock.get(&channel_id) { + monitor_holder + } else { + debug_assert!( + false, + "Tried to access non-existing monitor, this should never happen" + ); + break; + }; - let mut ser_chan = VecWriter(Vec::new()); - let min_seen_secret = mon.monitor.get_min_seen_secret(); - let counterparty_node_id = mon.monitor.get_counterparty_node_id(); + let mut serialized_channel = VecWriter(Vec::new()); + let min_seen_secret = monitor_holder.monitor.get_min_seen_secret(); + let counterparty_node_id = monitor_holder.monitor.get_counterparty_node_id(); { - let chan_mon = mon.monitor.inner.lock().unwrap(); + let inner_lock = monitor_holder.monitor.inner.lock().unwrap(); - write_chanmon_internal(&chan_mon, true, &mut ser_chan) + write_chanmon_internal(&inner_lock, true, &mut serialized_channel) .expect("can not write Channel Monitor for peer storage message"); } let peer_storage_monitor = PeerStorageMonitorHolder { - channel_id: *cid, + channel_id, min_seen_secret, counterparty_node_id, - monitor_bytes: ser_chan.0, + monitor_bytes: serialized_channel.0, }; - // Adding size of peer_storage_monitor. - curr_size += peer_storage_monitor.serialized_length(); - - if curr_size > MAX_PEER_STORAGE_SIZE { - break; + current_size += peer_storage_monitor.serialized_length(); + if current_size > MAX_PEER_STORAGE_SIZE { + continue; } + monitors_list.push(peer_storage_monitor); }

There might be still some room for improvement, but IMO something like this would be a lot more readable.

tnull · 2025-07-31T11:43:50Z

lightning/src/chain/chainmonitor.rs

@@ -809,11 +813,57 @@ where
 		mon.values().map(|monitor| monitor.monitor.get_counterparty_node_id()).collect()
 	}

+	#[cfg(peer_storage)]


I think you'll also need to introduce the cfg-gate in a few other places to get rid of the warning, e.g., above for all_counterparty_node_ids and on some imports.

should I also cfg-gate entropy_source inside ChainMonitor?
If yes, I would have to cfg-gate a lot of things in other files as well...
I think, It would be better to allow unused on it, since we are going to remove the peer-storage cfg gate in the next PR...

Well, we need to add the cfg-gate where ever we'll get a warning. I'd prefer to avoid allow_unused, especially as it might easily slip through (i.e., we might not remember to remove it again going forward), while for the cfg guard we will be forced to make the cleanup.

ldk-reviews-bot requested a review from joostjager June 28, 2025 10:32

tnull requested review from tnull and removed request for joostjager June 28, 2025 11:17

adi2011 force-pushed the peer-storage/serialise-deserialise branch 2 times, most recently from a35566a to 4c9f3c3 Compare June 29, 2025 05:04

TheBlueMatt reviewed Jun 30, 2025

View reviewed changes

tnull reviewed Jun 30, 2025

View reviewed changes

TheBlueMatt reviewed Jul 9, 2025

View reviewed changes

tnull reviewed Jul 11, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 676afbc to 1111f05 Compare July 16, 2025 20:41

tnull reviewed Jul 17, 2025

View reviewed changes

adi2011 requested a review from TheBlueMatt July 17, 2025 15:12

TheBlueMatt reviewed Jul 21, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 6dede0e to c9baefb Compare July 21, 2025 13:12

adi2011 requested a review from TheBlueMatt July 21, 2025 13:12

TheBlueMatt reviewed Jul 22, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch 4 times, most recently from f95f238 to 4304853 Compare July 24, 2025 11:16

adi2011 added 2 commits July 24, 2025 16:49

Remove #[rustfmt::skip] from fn write

28d7874

Fixed formatting for write() in ChannelMonitorImpl. This would make the next commit cleaner by ensuring it only contains direct code shifts, without unrelated formatting changes.

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 4304853 to e76f538 Compare July 24, 2025 11:20

adi2011 requested review from tnull and TheBlueMatt July 24, 2025 11:20

adi2011 force-pushed the peer-storage/serialise-deserialise branch from e76f538 to 75a8fd6 Compare July 24, 2025 11:24

TheBlueMatt reviewed Jul 24, 2025

View reviewed changes

adi2011 force-pushed the peer-storage/serialise-deserialise branch from 75a8fd6 to aa08357 Compare July 25, 2025 05:03

adi2011 added 3 commits July 25, 2025 11:31

Determine if we have lost data

2b3fdb9

Deserialise the ChannelMonitors and compare the data to determine if we have lost some states.

test: Modify test_peer_storage to check latest changes

662fff1

Node should now determine lost states using retrieved peer storage.

adi2011 force-pushed the peer-storage/serialise-deserialise branch from aa08357 to 662fff1 Compare July 25, 2025 06:02

fixup: Allow unused_import, these are used in cfg(peer_storage)

8a05784

tnull reviewed Jul 28, 2025

View reviewed changes

adi2011 added 2 commits July 29, 2025 17:31

fixup: Serialise ChannelMonitors and send them over inside Peer Storage

e1f702f

fixup: test: Modify test_peer_storage to check latest changes

6328b33

graphite-app bot reviewed Jul 29, 2025

View reviewed changes

adi2011 requested a review from tnull July 29, 2025 16:17

tnull reviewed Jul 31, 2025

View reviewed changes

Peer Storage (Part 3): Identifying Lost Channel States #3897

Are you sure you want to change the base?

Peer Storage (Part 3): Identifying Lost Channel States #3897

Uh oh!

Conversation

adi2011 commented Jun 28, 2025

Uh oh!

ldk-reviews-bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ldk-reviews-bot commented Jun 30, 2025

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ldk-reviews-bot commented Jun 28, 2025 •

edited

Loading

codecov bot commented Jun 29, 2025 •

edited

Loading