Skip to content

Commit 644d9d9

Browse files
committed
docs: add comprehensive gossip rate limiting guide
In this commit, we add detailed documentation to help node operators understand and configure the gossip rate limiting system effectively. The new guide addresses a critical knowledge gap that has led to misconfigured nodes experiencing synchronization failures. The documentation covers the token bucket algorithm used for rate limiting, providing clear formulas and examples for calculating appropriate values based on node size and network position. We include specific recommendations ranging from 50 KB/s for small nodes to 1 MB/s for large routing nodes, with detailed calculations showing how these values are derived. The guide explains the relationship between rate limiting and other configuration options like num-restricted-slots and the new filter-concurrency setting. We provide troubleshooting steps for common issues like slow initial sync and peer disconnections, along with debug commands and log patterns to identify problems. Configuration examples are provided for conservative, balanced, and performance-oriented setups, giving operators concrete starting points they can adapt to their specific needs. The documentation emphasizes the importance of not setting rate limits too low, warning that values below 50 KB/s can cause synchronization to fail entirely.
1 parent a4fc737 commit 644d9d9

File tree

1 file changed

+257
-0
lines changed

1 file changed

+257
-0
lines changed

docs/gossip_rate_limiting.md

Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# Gossip Rate Limiting Configuration Guide
2+
3+
When running a Lightning node, one of the most critical yet often overlooked
4+
aspects is properly configuring the gossip rate limiting system. This guide will
5+
help you understand how LND manages outbound gossip traffic and how to tune
6+
these settings for your specific needs.
7+
8+
## Understanding Gossip Rate Limiting
9+
10+
At its core, LND uses a token bucket algorithm to control how much bandwidth it
11+
dedicates to sending gossip messages to other nodes. Think of it as a bucket
12+
that fills with tokens at a steady rate. Each time your node sends a gossip
13+
message, it consumes tokens equal to the message size. If the bucket runs dry,
14+
messages must wait until enough tokens accumulate.
15+
16+
This system serves an important purpose: it prevents any single peer, or group
17+
of peers, from overwhelming your node's network resources. Without rate
18+
limiting, a misbehaving peer could request your entire channel graph repeatedly,
19+
consuming all your bandwidth and preventing normal operation.
20+
21+
## Core Configuration Options
22+
23+
The gossip rate limiting system has several configuration options that work
24+
together to control your node's behavior.
25+
26+
### Setting the Sustained Rate: gossip.msg-rate-bytes
27+
28+
The most fundamental setting is `gossip.msg-rate-bytes`, which determines how
29+
many bytes per second your node will allocate to outbound gossip messages. This
30+
rate is shared across all connected peers, not per-peer.
31+
32+
The default value of 102,400 bytes per second (100 KB/s) works well for most
33+
nodes, but you may need to adjust it based on your situation. Setting this value
34+
too low can cause serious problems. When the rate limit is exhausted, peers
35+
waiting to synchronize must queue up, potentially waiting minutes between
36+
messages. Values below 50 KB/s can make initial synchronization fail entirely,
37+
as peers timeout before receiving the data they need.
38+
39+
### Managing Burst Capacity: gossip.msg-burst-bytes
40+
41+
The burst capacity, configured via `gossip.msg-burst-bytes`, determines the
42+
maximum message size that can be sent at once. Despite what you might think,
43+
this isn't about handling traffic spikes—it's simply the size of your token
44+
bucket. Any single message larger than this value can never be sent, regardless
45+
of how long you wait.
46+
47+
The default of 204,800 bytes (200 KB) is carefully chosen to accommodate the
48+
largest gossip messages while preventing excessive bursts. There's rarely a need
49+
to change this value unless you're seeing specific errors about message size
50+
limits.
51+
52+
### Controlling Concurrent Operations: gossip.filter-concurrency
53+
54+
When peers apply gossip filters to request specific channel updates, these
55+
operations can consume significant resources. The `gossip.filter-concurrency`
56+
setting limits how many of these operations can run simultaneously. The default
57+
value of 5 provides a reasonable balance between resource usage and
58+
responsiveness.
59+
60+
Large routing nodes handling many simultaneous peer connections might benefit
61+
from increasing this value to 10 or 15, while resource-constrained nodes should
62+
keep it at the default or even reduce it slightly.
63+
64+
### Understanding Connection Limits: num-restricted-slots
65+
66+
The `num-restricted-slots` configuration deserves special attention because it
67+
directly affects your gossip bandwidth requirements. This setting limits inbound
68+
connections, but not in the way you might expect.
69+
70+
LND maintains a three-tier system for peer connections. Peers you've ever had
71+
channels with enjoy "protected" status and can always connect. Peers currently
72+
opening channels with you have "temporary" status. Everyone else—new peers
73+
without channels—must compete for the limited "restricted" slots.
74+
75+
When a new peer without channels connects inbound, they consume one restricted
76+
slot. If all slots are full, additional peers are turned away. However, as soon
77+
as a restricted peer begins opening a channel, they're upgraded to temporary
78+
status, freeing their slot. This creates breathing room for large nodes to form
79+
new channel relationships without constantly rejecting connections.
80+
81+
The relationship between restricted slots and rate limiting is straightforward:
82+
more allowed connections mean more peers requesting data, requiring more
83+
bandwidth. A reasonable rule of thumb is to allocate at least 1 KB/s of rate
84+
limit per restricted slot.
85+
86+
## Calculating Appropriate Values
87+
88+
To set these values correctly, you need to understand your node's position in
89+
the network and its typical workload. The fundamental question is: how much
90+
gossip traffic does your node actually need to handle?
91+
92+
Start by considering how many peers typically connect to your node. A hobbyist
93+
node might have 10-20 connections, while a well-connected routing node could
94+
easily exceed 100. Each peer generates gossip traffic when syncing channel
95+
updates, announcing new channels, or requesting historical data.
96+
97+
The calculation itself is straightforward. Take your average message size
98+
(typically 150-200 bytes for gossip messages), multiply by your peer count and
99+
expected message frequency, then add a safety factor for traffic spikes. Here's
100+
the formula:
101+
102+
```
103+
rate = avg_msg_size × peer_count × msgs_per_second × safety_factor
104+
```
105+
106+
Let's walk through some real-world examples to make this concrete.
107+
108+
For a small node with 15 peers, you might see 10 messages per peer per second
109+
during normal operation. With an average message size of 170 bytes and a safety
110+
factor of 1.5, you'd need about 38 KB/s. Rounding up to 50 KB/s provides
111+
comfortable headroom.
112+
113+
A medium-sized node with 75 peers faces different challenges. These nodes often
114+
relay more traffic and handle more frequent updates. With 15 messages per peer
115+
per second, the calculation yields about 287 KB/s. Setting the limit to 300 KB/s
116+
ensures smooth operation without waste.
117+
118+
Large routing nodes require the most careful consideration. With 150 or more
119+
peers and high message frequency, bandwidth requirements can exceed 1 MB/s.
120+
These nodes form the backbone of the Lightning Network and need generous
121+
allocations to serve their peers effectively.
122+
123+
Remember that the relationship between restricted slots and rate limiting is
124+
direct: each additional slot potentially adds another peer requesting data. Plan
125+
for at least 1 KB/s per restricted slot to maintain healthy synchronization.
126+
127+
## Network Size and Geography
128+
129+
The Lightning Network's growth directly impacts your gossip bandwidth needs.
130+
With over 80,000 public channels at the time of writing, each generating
131+
multiple updates daily, the volume of gossip traffic continues to increase. A
132+
channel update occurs whenever a node adjusts its fees, changes its routing
133+
policy, or goes offline temporarily. During volatile market conditions or fee
134+
market adjustments, update frequency can spike dramatically.
135+
136+
Geographic distribution adds another layer of complexity. If your node connects
137+
to peers across continents, the inherent network latency affects how quickly you
138+
can exchange messages. However, this primarily impacts initial connection
139+
establishment rather than ongoing rate limiting.
140+
141+
## Troubleshooting Common Issues
142+
143+
When rate limiting isn't configured properly, the symptoms are often subtle at
144+
first but can cascade into serious problems.
145+
146+
The most common issue is slow initial synchronization. New peers attempting to
147+
download your channel graph experience long delays between messages. You'll see
148+
entries in your logs like "rate limiting gossip replies, responding in 30s" or
149+
even longer delays. This happens because the rate limiter has exhausted its
150+
tokens and must wait for refill. The solution is straightforward: increase your
151+
msg-rate-bytes setting.
152+
153+
Peer disconnections present a more serious problem. When peers wait too long for
154+
gossip responses, they may timeout and disconnect. This creates a vicious cycle
155+
where peers repeatedly connect, attempt to sync, timeout, and reconnect. Look
156+
for "peer timeout" errors in your logs. If you see these, you need to increase
157+
your rate limit.
158+
159+
Sometimes you'll notice unusually high CPU usage from your LND process. This
160+
often indicates that many goroutines are blocked waiting for rate limiter
161+
tokens. The rate limiter must constantly calculate delays and manage waiting
162+
threads. Increasing the rate limit reduces this contention and lowers CPU usage.
163+
164+
To debug these issues, focus on your LND logs rather than high-level commands.
165+
Search for "rate limiting" messages to understand how often delays occur and how
166+
long they last. Look for patterns in peer disconnections that might correlate
167+
with rate limiting delays. The specific commands that matter are:
168+
169+
```bash
170+
# View peer connections and sync state
171+
lncli listpeers | grep -A5 "sync_type"
172+
173+
# Check recent rate limiting events
174+
grep "rate limiting" ~/.lnd/logs/bitcoin/mainnet/lnd.log | tail -20
175+
```
176+
177+
Pay attention to log entries showing "Timestamp range queue full" if you've
178+
implemented the queue-based approach—this indicates your system is shedding load
179+
due to overwhelming demand.
180+
181+
## Best Practices for Configuration
182+
183+
Experience has shown that starting with conservative (higher) rate limits and
184+
reducing them if needed works better than starting too low and debugging
185+
problems. It's much easier to notice excess bandwidth usage than to diagnose
186+
subtle synchronization failures.
187+
188+
Monitor your node's actual bandwidth usage and sync times after making changes.
189+
Most operating systems provide tools to track network usage per process. When
190+
adjusting settings, make gradual changes of 25-50% rather than dramatic shifts.
191+
This helps you understand the impact of each change and find the sweet spot for
192+
your setup.
193+
194+
Keep your burst size at least double the largest message size you expect to
195+
send. While the default 200 KB is usually sufficient, monitor your logs for any
196+
"message too large" errors that would indicate a need to increase this value.
197+
198+
As your node grows and attracts more peers, revisit these settings periodically.
199+
What works for 50 peers may cause problems with 150 peers. Regular review
200+
prevents gradual degradation as conditions change.
201+
202+
## Configuration Examples
203+
204+
For most users running a personal node, conservative settings provide reliable
205+
operation without excessive resource usage:
206+
207+
```
208+
[Application Options]
209+
gossip.msg-rate-bytes=204800
210+
gossip.msg-burst-bytes=409600
211+
gossip.filter-concurrency=5
212+
num-restricted-slots=100
213+
```
214+
215+
Well-connected nodes that route payments regularly need more generous
216+
allocations:
217+
218+
```
219+
[Application Options]
220+
gossip.msg-rate-bytes=524288
221+
gossip.msg-burst-bytes=1048576
222+
gossip.filter-concurrency=10
223+
num-restricted-slots=200
224+
```
225+
226+
Large routing nodes at the heart of the network require the most resources:
227+
228+
```
229+
[Application Options]
230+
gossip.msg-rate-bytes=1048576
231+
gossip.msg-burst-bytes=2097152
232+
gossip.filter-concurrency=15
233+
num-restricted-slots=300
234+
```
235+
236+
## Critical Warning About Low Values
237+
238+
Setting `gossip.msg-rate-bytes` below 50 KB/s creates serious operational
239+
problems that may not be immediately obvious. Initial synchronization, which
240+
typically transfers 10-20 MB of channel graph data, can take hours or fail
241+
entirely. Peers appear to connect but remain stuck in a synchronization loop,
242+
never completing their initial download.
243+
244+
Your channel graph remains perpetually outdated, causing routing failures as you
245+
attempt to use channels that have closed or changed their fee policies. The
246+
gossip subsystem appears to work, but operates so slowly that it cannot keep
247+
pace with network changes.
248+
249+
During normal operation, a well-connected node processes hundreds of channel
250+
updates per minute. Each update is small, but they add up quickly. Factor in
251+
occasional bursts during network-wide fee adjustments or major routing node
252+
policy changes, and you need substantial headroom above the theoretical minimum.
253+
254+
The absolute minimum viable configuration requires at least enough bandwidth to
255+
complete initial sync in under an hour and process ongoing updates without
256+
falling behind. This translates to no less than 50 KB/s for even the smallest
257+
nodes.

0 commit comments

Comments
 (0)