Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restarting Lodestar takes to long time #7378

Open
jonathanudd opened this issue Jan 20, 2025 · 5 comments
Open

Restarting Lodestar takes to long time #7378

jonathanudd opened this issue Jan 20, 2025 · 5 comments
Labels
meta-bug Issues that identify a bug and require a fix.

Comments

@jonathanudd
Copy link

Describe the bug

After restarting the Lodestar client it takes around 35 minutes until it starts syncing again.

Jan 20 11:44:17 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:17.001[]                 info: Synced - slot: 10879119 - head: 0x9f28…726d - exec-block: valid(21665484 0x9fbd…) - finalized: 0xf757…66d8:339970 - peers: 99
Jan 20 11:44:29 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:29.001[]                 info: Synced - slot: 10879120 - head: 0x507f…af43 - exec-block: valid(21665485 0x22ea…) - finalized: 0xf757…66d8:339970 - peers: 99
Jan 20 11:44:41 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:41.000[]                 info: Synced - slot: 10879121 - head: (slot -1) 0x507f…af43 - exec-block: valid(21665485 0x22ea…) - finalized: 0xf757…66d8:339970 - peers: 101
Jan 20 11:44:51 juju-9c14a2-1 systemd[1]: Stopping Lodestar beacon node service...
Jan 20 11:44:51 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:51.583[]                 info: Stopping gracefully, use Ctrl+C again to force process exit
Jan 20 11:44:51 juju-9c14a2-1 lodestar[2535627]: Jan-20 11:44:51.587[]                 info: Forcing process exit
Jan 20 11:44:53 juju-9c14a2-1 systemd[1]: lodestar.service: Deactivated successfully.
Jan 20 11:44:53 juju-9c14a2-1 systemd[1]: Stopped Lodestar beacon node service.
Jan 20 11:44:53 juju-9c14a2-1 systemd[1]: lodestar.service: Consumed 5d 22h 27min 5.671s CPU time.
Jan 20 11:44:55 juju-9c14a2-1 systemd[1]: Started Lodestar beacon node service.
Jan 20 11:45:07 juju-9c14a2-1 lodestar[2829990]: Unpacking Lodestar binary, please wait.....
Jan 20 11:45:10 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:10.909[]                 info: Lodestar network=mainnet, version=v1.25.0/191f2f0, commit=191f2f0ac71d4f660d1ffbd04044a58de114c049
Jan 20 11:45:13 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:13.418[]                 info: Connected to LevelDB database path=/home/lodestar/.datadir/chain-db
Jan 20 11:45:52 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:52.436[]                 info: Initializing beacon from a valid db state slot=10879040, epoch=339970, stateRoot=0xeb835954000f30ba1d200ab4f0cb7ffa8396a022c51e824763908ae9494d13a5, isWithinWeakSubjectivityPeriod=true
Jan 20 11:45:54 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:54.482[chain]            info: Historical state worker started
Jan 20 11:45:54 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:54.497[eth1]             info: Eth1 provider urls=http://192.168.111.227:8551
Jan 20 11:45:54 juju-9c14a2-1 lodestar[2830044]: Jan-20 11:45:54.507[execution]        info: Execution client urls=http://192.168.111.227:8551
Jan 20 12:02:27 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:02:27.974[chain]            info: Removed persisted checkpoint states from the last run count=361, maxEpochsInMemory=3
Jan 20 12:20:26 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:26.215[network]          info: running libp2p instance in worker thread
Jan 20 12:20:27 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:27.737[network]          info: libp2p worker started peer=16Uiu2HAm4u5AvYUVdy3LMjXTVauRCU2mb5mrfEaynK3vFqopV5dW
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.456[network]          info: PeerId 16Uiu2HAm4u5AvYUVdy3LMjXTVauRCU2mb5mrfEaynK3vFqopV5dW, Multiaddrs /ip4/0.0.0.0/tcp/9000
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.463[metrics]          info: Started metrics HTTP server address=http://0.0.0.0:8008
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.680[rest]             info: Started REST API server address=http://127.0.0.1:9596
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.681[rest]             warn: REST API server is exposed, ensure untrusted traffic cannot reach this API
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.684[]                 info: Searching peers - peers: 0 - slot: 10879301 - head: (slot -261) 0xf757…66d8 - exec-block: syncing(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968
Jan 20 12:20:35 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:35.433[network]          info: discv5 worker started peerId=16Uiu2HAm4u5AvYUVdy3LMjXTVauRCU2mb5mrfEaynK3vFqopV5dW, initialENR=enr:-IO4QCdo6dWJS1x0H-GD_o_FT92zibI9KKwBhcvwd6Mi_IVAVxYntIUEjm7tnYsjd9_cLSEsar15Cq3BFAl6jpIOjaYBgmlkgnY0iXNlY3AyNTZrMaECjL4WXLLzcFO-fcsRJ-XRp0Rnm00JIgZjkLqWSSr9ueWDdGNwgiMog3VkcIIjKA, bindAddr4=/ip4/0.0.0.0/udp/9000
Jan 20 12:20:53 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:20:53.526[]                 info: Syncing - ? left - 0.00 slots/s - slot: 10879302 - head: (slot -262) 0xf757…66d8 - exec-block: syncing(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968 - peers: 13
Jan 20 12:21:02 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:02.829[execution]        info: Execution client is synced oldState=ONLINE, newState=SYNCED
Jan 20 12:21:05 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:05.095[]                 info: Syncing - ? left - 0.00 slots/s - slot: 10879303 - head: (slot -263) 0xf757…66d8 - exec-block: syncing(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968 - peers: 10
Jan 20 12:21:17 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:17.648[]                 info: Syncing - ? left - 0.00 slots/s - slot: 10879304 - head: (slot -264) 0xf757…66d8 - exec-block: valid(21665405 0x42cb…) - finalized: 0x55e2…14ec:339968 - peers: 13
Jan 20 12:21:30 juju-9c14a2-1 lodestar[2830044]: Jan-20 12:21:30.034[]                 info: Syncing - 3.7 minutes left - 0.906 slots/s - slot: 10879305 - head: (slot -202) 0xf8f9…40bb - exec-block: valid(21665468 0xf130…) - finalized: 0xf757…66d8:339970 - peers: 20

Expected behavior

I would expect a restart to take at max 1 minute or at least a lot less than 35 minutes.

Steps to reproduce

Lodestar is running as a systemd service using this file

[Unit]
Description=Lodestar beacon node service
After=network.target
Documentation=https://github.com/ChainSafe/lodestar

[Service]
WorkingDirectory=/home/lodestar
EnvironmentFile=/etc/default/lodestar
ExecStart=/home/lodestar/lodestar beacon --metrics --metrics.port=8008 --metrics.address=0.0.0.0 --dataDir=/home/lodestar/.datadir --rest --rest.port=9596 --rest.cors=* --rest.address=0.0.0.0 --execution.urls=http://192.168.211.70:8551 --jwtSecret=/home/lodestar/jwtsecret --checkpointSyncUrl=https://sync-mainnet.beaconcha.in/ --network=mainnet --rest.namespace=beacon,lightclient,debug,node --chain.archiveStateEpochFrequency=1
User=lodestar
Group=lodestar
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

Additional context

This has been the case for the last few versions I think.

Operating system

Linux

Lodestar version or commit hash

v1.24.0/ca315d4

@jonathanudd jonathanudd added the meta-bug Issues that identify a bug and require a fix. label Jan 20, 2025
@nflaig
Copy link
Member

nflaig commented Jan 20, 2025

The first issue I see is that the node is not gracefully shut down, force exit happens right after first shutdown signal is sent

Jan-20 11:44:51.583[]                 info: Stopping gracefully, use Ctrl+C again to force process exit
Jan-20 11:44:51.587[]                 info: Forcing process exit

Shutting down the node will ensure it properly archives the last finalized state which should avoid having to sync 261 slots after the restart.

The second reason seems to be that had to remove 361 checkpoints, likely related to --chain.archiveStateEpochFrequency=1

Jan-20 12:02:27.974[chain]            info: Removed persisted checkpoint states from the last run count=361, maxEpochsInMemory=3

Usually it takes only 2-3 mins for the node to be back at head after a restart and have a good peer count.

@jonathanudd
Copy link
Author

Thanks for the hints, I should have noticed the info: Forcing process exit log and suspected it.

--chain.archiveStateEpochFrequency=1 is requested by the user of the node so I might be stuck with this.

But as for the forced process exit I've been trying to modify the service file but I cannot get it to exit gracefully.

This is the current service file I have tested with now

[Unit]
Description=Lodestar beacon node service
After=network.target
Documentation=https://github.com/ChainSafe/lodestar

[Service]
WorkingDirectory=/home/lodestar
EnvironmentFile=/etc/default/lodestar
ExecStart=/home/lodestar/lodestar beacon $BEACON_CLI_ARGS
ExecStop=/bin/kill -s SIGINT $MAINPID # Should ensure that SIGINT is sent to Lodestar when stopping the service
KillSignal=SIGINT # Same as above
TimeoutStopSec=60s # Should ensure that it's not forcefully exited to quickly, this should not matter though since the log for forcefull exit is printed the same second as the stopping log
SendSIGKILL=no # Should prevent systemd from sending a kill signal
User=lodestar
Group=lodestar
Restart=always
RestartSec=120

[Install]
WantedBy=multi-user.target

And if I with this file run sudo systemctl stop lodestar I still see

Jan 21 13:49:11 juju-a10726-0 lodestar[6778]: Jan-21 13:49:11.146[]                 info: Stopping gracefully, use Ctrl+C again to force process exit
Jan 21 13:49:11 juju-a10726-0 lodestar[6778]: Jan-21 13:49:11.150[]                 info: Forcing process exit

Do you have any ideas why it behaves like this?

@nflaig
Copy link
Member

nflaig commented Jan 21, 2025

Just looking at a service file we have, all that is set there is TimeoutSec=1min but should be covered by TimeoutStopSec

@nflaig
Copy link
Member

nflaig commented Jan 21, 2025

Just looking at a service file we have, all that is set there is TimeoutSec=1min but should be covered by TimeoutStopSec

and just noticed the nodes we run via systemd are also force exited...

@nflaig
Copy link
Member

nflaig commented Jan 21, 2025

Ok I found the issue on our end and I think it's a similar problem for you, the process is not properly executed as it's from a bash script so systemd thinks it's closed but it isn't

ExecStart=/home/lodestar/lodestar beacon $BEACON_CLI_ARGS

try change this to

ExecStart=node --max-old-space-size=8192 /home/lodestar/packages/cli/bin/lodestar.js beacon $BEACON_CLI_ARGS

wemeetagain pushed a commit that referenced this issue Jan 23, 2025
**Motivation**

Users might use the lodestar script inside docker or with a process
manager like systemd (#7378)
and in those cases we need to make sure Lodestar runs as primary process
and not the shell script as otherwise exit signals are not properly
forwarded or in case of systemd it does not wait for Lodestar to shut
down.

**Description**

Update lodestar script to properly forward exit signals by execution the
main node process via `exec`. This shouldn't change anything in regard
to how we use the script during development.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta-bug Issues that identify a bug and require a fix.
Projects
None yet
Development

No branches or pull requests

2 participants