Restarting Lodestar takes to long time #7378

jonathanudd · 2025-01-20T12:43:35Z

Describe the bug

After restarting the Lodestar client it takes around 35 minutes until it starts syncing again.

Jan 20 11:44:17 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:17.001[]                 info: Synced - slot: 10879119 - head: 0x9f28…726d - exec-block: valid(21665484 0x9fbd…) - finalized: 0xf757…66d8:339970 - peers: 99
Jan 20 11:44:29 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:29.001[]                 info: Synced - slot: 10879120 - head: 0x507f…af43 - exec-block: valid(21665485 0x22ea…) - finalized: 0xf757…66d8:339970 - peers: 99
Jan 20 11:44:41 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:41.000[]                 info: Synced - slot: 10879121 - head: (slot -1) 0x507f…af43 - exec-block: valid(21665485 0x22ea…) - finalized: 0xf757…66d8:339970 - peers: 101
Jan 20 11:44:51 juju-9c14a2-1 systemd[1]: Stopping Lodestar beacon node service...
Jan 20 11:44:51 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:51.583[]                 info: Stopping gracefully, use Ctrl+C again to force process exit
Jan 20 11:44:51 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:51.587[]                 info: Forcing process exit
Jan 20 11:44:53 juju-9c14a2-1 systemd[1]: lodestar.service: Deactivated successfully.
Jan 20 11:44:53 juju-9c14a2-1 systemd[1]: Stopped Lodestar beacon node service.
Jan 20 11:44:53 juju-9c14a2-1 systemd[1]: lodestar.service: Consumed 5d 22h 27min 5.671s CPU time.
Jan 20 11:44:55 juju-9c14a2-1 systemd[1]: Started Lodestar beacon node service.
Jan 20 11:45:07 juju-9c14a2-1 lodestar[2829990]: Unpacking Lodestar binary, please wait.....
Jan 20 11:45:10 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:10.909[]                 info: Lodestar network=mainnet, version=v1.25.0/191f2f0, commit=191f2f0ac71d4f660d1ffbd04044a58de114c049
Jan 20 11:45:13 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:13.418[]                 info: Connected to LevelDB database path=/home/lodestar/.datadir/chain-db
Jan 20 11:45:52 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:52.436[]                 info: Initializing beacon from a valid db state slot=10879040, epoch=339970, stateRoot=0xeb835954000f30ba1d200ab4f0cb7ffa8396a022c51e824763908ae9494d13a5, isWithinWeakSubjectivityPeriod=true
Jan 20 11:45:54 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:54.482[chain]            info: Historical state worker started
Jan 20 11:45:54 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:54.497[eth1]             info: Eth1 provider urls=http://192.168.111.227:8551
Jan 20 11:45:54 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:54.507[execution]        info: Execution client urls=http://192.168.111.227:8551
Jan 20 12:02:27 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:02:27.974[chain]            info: Removed persisted checkpoint states from the last run count=361, maxEpochsInMemory=3
Jan 20 12:20:26 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:26.215[network]          info: running libp2p instance in worker thread
Jan 20 12:20:27 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:27.737[network]          info: libp2p worker started peer=16Uiu2HAm4u5AvYUVdy3LMjXTVauRCU2mb5mrfEaynK3vFqopV5dW
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.456[network]          info: PeerId 16Uiu2HAm4u5AvYUVdy3LMjXTVauRCU2mb5mrfEaynK3vFqopV5dW, Multiaddrs /ip4/0.0.0.0/tcp/9000
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.463[metrics]          info: Started metrics HTTP server address=http://0.0.0.0:8008
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.680[rest]             info: Started REST API server address=http://127.0.0.1:9596
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.681[rest]             warn: REST API server is exposed, ensure untrusted traffic cannot reach this API
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.684[]                 info: Searching peers - peers: 0 - slot: 10879301 - head: (slot -261) 0xf757…66d8 - exec-block: syncing(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.433[network]          info: discv5 worker started peerId=16Uiu2HAm4u5AvYUVdy3LMjXTVauRCU2mb5mrfEaynK3vFqopV5dW, initialENR=enr:-IO4QCdo6dWJS1x0H-GD_o_FT92zibI9KKwBhcvwd6Mi_IVAVxYntIUEjm7tnYsjd9_cLSEsar15Cq3BFAl6jpIOjaYBgmlkgnY0iXNlY3AyNTZrMaECjL4WXLLzcFO-fcsRJ-XRp0Rnm00JIgZjkLqWSSr9ueWDdGNwgiMog3VkcIIjKA, bindAddr4=/ip4/0.0.0.0/udp/9000
Jan 20 12:20:53 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:53.526[]                 info: Syncing - ? left - 0.00 slots/s - slot: 10879302 - head: (slot -262) 0xf757…66d8 - exec-block: syncing(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968 - peers: 13
Jan 20 12:21:02 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:02.829[execution]        info: Execution client is synced oldState=ONLINE, newState=SYNCED
Jan 20 12:21:05 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:05.095[]                 info: Syncing - ? left - 0.00 slots/s - slot: 10879303 - head: (slot -263) 0xf757…66d8 - exec-block: syncing(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968 - peers: 10
Jan 20 12:21:17 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:17.648[]                 info: Syncing - ? left - 0.00 slots/s - slot: 10879304 - head: (slot -264) 0xf757…66d8 - exec-block: valid(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968 - peers: 13
Jan 20 12:21:30 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:30.034[]                 info: Syncing - 3.7 minutes left - 0.906 slots/s - slot: 10879305 - head: (slot -202) 0xf8f9…40bb - exec-block: valid(21665468 0xf130…) - finalized: 0xf757…66d8:339970 - peers: 20

Expected behavior

I would expect a restart to take at max 1 minute or at least a lot less than 35 minutes.

Steps to reproduce

Lodestar is running as a systemd service using this file

[Unit]
Description=Lodestar beacon node service
After=network.target
Documentation=https://github.com/ChainSafe/lodestar

[Service]
WorkingDirectory=/home/lodestar
EnvironmentFile=/etc/default/lodestar
ExecStart=/home/lodestar/lodestar beacon --metrics --metrics.port=8008 --metrics.address=0.0.0.0 --dataDir=/home/lodestar/.datadir --rest --rest.port=9596 --rest.cors=* --rest.address=0.0.0.0 --execution.urls=http://192.168.211.70:8551 --jwtSecret=/home/lodestar/jwtsecret --checkpointSyncUrl=https://sync-mainnet.beaconcha.in/ --network=mainnet --rest.namespace=beacon,lightclient,debug,node --chain.archiveStateEpochFrequency=1
User=lodestar
Group=lodestar
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

Additional context

This has been the case for the last few versions I think.

Operating system

Linux

Lodestar version or commit hash

v1.24.0/ca315d4

The text was updated successfully, but these errors were encountered:

nflaig · 2025-01-20T13:26:08Z

The first issue I see is that the node is not gracefully shut down, force exit happens right after first shutdown signal is sent

Jan-20 11:44:51.583[]                 info: Stopping gracefully, use Ctrl+C again to force process exit
Jan-20 11:44:51.587[]                 info: Forcing process exit

Shutting down the node will ensure it properly archives the last finalized state which should avoid having to sync 261 slots after the restart.

The second reason seems to be that had to remove 361 checkpoints, likely related to --chain.archiveStateEpochFrequency=1

Jan-20 12:02:27.974[chain]            info: Removed persisted checkpoint states from the last run count=361, maxEpochsInMemory=3

Usually it takes only 2-3 mins for the node to be back at head after a restart and have a good peer count.

jonathanudd · 2025-01-21T14:24:53Z

Thanks for the hints, I should have noticed the info: Forcing process exit log and suspected it.

--chain.archiveStateEpochFrequency=1 is requested by the user of the node so I might be stuck with this.

But as for the forced process exit I've been trying to modify the service file but I cannot get it to exit gracefully.

This is the current service file I have tested with now

[Unit]
Description=Lodestar beacon node service
After=network.target
Documentation=https://github.com/ChainSafe/lodestar

[Service]
WorkingDirectory=/home/lodestar
EnvironmentFile=/etc/default/lodestar
ExecStart=/home/lodestar/lodestar beacon $BEACON_CLI_ARGS
ExecStop=/bin/kill -s SIGINT $MAINPID # Should ensure that SIGINT is sent to Lodestar when stopping the service
KillSignal=SIGINT # Same as above
TimeoutStopSec=60s # Should ensure that it's not forcefully exited to quickly, this should not matter though since the log for forcefull exit is printed the same second as the stopping log
SendSIGKILL=no # Should prevent systemd from sending a kill signal
User=lodestar
Group=lodestar
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

And if I with this file run sudo systemctl stop lodestar I still see

Jan 21 13:49:11 juju-a10726-0 lodestar[6778]: Jan-21 13:49:11.146[]                 info: Stopping gracefully, use Ctrl+C again to force process exit
Jan 21 13:49:11 juju-a10726-0 lodestar[6778]: Jan-21 13:49:11.150[]                 info: Forcing process exit

Do you have any ideas why it behaves like this?

nflaig · 2025-01-21T14:30:59Z

Just looking at a service file we have, all that is set there is TimeoutSec=1min but should be covered by TimeoutStopSec

nflaig · 2025-01-21T14:49:06Z

Just looking at a service file we have, all that is set there is TimeoutSec=1min but should be covered by TimeoutStopSec

and just noticed the nodes we run via systemd are also force exited...

nflaig · 2025-01-21T15:04:41Z

Ok I found the issue on our end and I think it's a similar problem for you, the process is not properly executed as it's from a bash script so systemd thinks it's closed but it isn't

ExecStart=/home/lodestar/lodestar beacon $BEACON_CLI_ARGS

try change this to

ExecStart=node --max-old-space-size=8192 /home/lodestar/packages/cli/bin/lodestar.js beacon $BEACON_CLI_ARGS

**Motivation** Users might use the lodestar script inside docker or with a process manager like systemd (#7378) and in those cases we need to make sure Lodestar runs as primary process and not the shell script as otherwise exit signals are not properly forwarded or in case of systemd it does not wait for Lodestar to shut down. **Description** Update lodestar script to properly forward exit signals by execution the main node process via `exec`. This shouldn't change anything in regard to how we use the script during development.

jonathanudd added the meta-bug Issues that identify a bug and require a fix. label Jan 20, 2025

nflaig mentioned this issue Jan 21, 2025

chore: update lodestar script to properly forward exit signals #7383

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restarting Lodestar takes to long time #7378

Restarting Lodestar takes to long time #7378

jonathanudd commented Jan 20, 2025

nflaig commented Jan 20, 2025

jonathanudd commented Jan 21, 2025

nflaig commented Jan 21, 2025

nflaig commented Jan 21, 2025

nflaig commented Jan 21, 2025

Restarting Lodestar takes to long time #7378

Restarting Lodestar takes to long time #7378

Comments

jonathanudd commented Jan 20, 2025

Describe the bug

Expected behavior

Steps to reproduce

Additional context

Operating system

Lodestar version or commit hash

nflaig commented Jan 20, 2025

jonathanudd commented Jan 21, 2025

nflaig commented Jan 21, 2025

nflaig commented Jan 21, 2025

nflaig commented Jan 21, 2025