-
Notifications
You must be signed in to change notification settings - Fork 26
Description
In trying to reproduce #1788 I put together https://github.com/oxidecomputer/crucible_jepsen
So far I have not triggered that issue, but have been able to reliably trigger an availability issue after restarting each downstairs node.
The jepsen test is doing the following:
On each ds node:
# once during setup
rm -rf /opt/crucible/data
/opt/crucible/crucible-downstairs create -u $random-uuid -d /opt/crucible/data --encrypted
# via smf
/opt/crucible/crucible-downstairs run -p 3810 -d /opt/crucible/data
On the upstairs:
# https://github.com/oxidecomputer/crucible/blob/rest-server-for-jepsen/rest_server/src/main.rs
/opt/crucible/crucible-rest-server -g 1 --key rgcfQ6nDFT5ZuMeqrUGBW2txZ9tZh/zQUc9ZnhnwDH4= --listen 0.0.0.0:9999 --target=172.21.252.170:3810 --target=172.21.252.173:3810 --target=172.21.252.174:3810
Then, during the test, ds nodes are randomly rebooted using:
mdb -kwe 'clock/W -1'
For the test:
172.21.252.194: upstairs / crucible-rest-server
172.21.252.170: ds
172.21.252.173: ds
172.21.252.174: ds
Most recent test:
$ lein run test --nodes-file my.nodes --concurrency 100 --time-limit 1800
# wait 30 minutes
The issue is easy to see in the graphs jepsen produces:
The left edge of the gray sections at the top correspond to the reboot being triggered. Jepsen isn't directly controlling the VM coming back up, so the width of the gray section is approximate.. the actual 'outage' is however long the node takes to reboot and smf to restart the downstairs.
$cat crucible_jepsen_20251021T134348/jepsen.log |egrep -e Relat -e 'Rebooting'|sed "s/.clojure.*ble://"
2025-10-21 13:44:01,364{GMT} INFO [jepsen test runner] jepsen.core: Relative time begins now
2025-10-21 13:46:31,607{GMT} INFO 172.21.252.173 Rebooting in 5 seconds
2025-10-21 13:49:36,963{GMT} INFO 172.21.252.170 Rebooting in 5 seconds
2025-10-21 13:52:42,335{GMT} INFO 172.21.252.173 Rebooting in 5 seconds
2025-10-21 13:55:47,718{GMT} INFO 172.21.252.174 Rebooting in 5 seconds
2025-10-21 13:58:53,090{GMT} INFO 172.21.252.170 Rebooting in 5 seconds
2025-10-21 14:01:58,427{GMT} INFO 172.21.252.173 Rebooting in 5 seconds
first failed writes were at 13:51 after the second node reboot:
$ cat crucible_jepsen_20251021T134348/jepsen.log |grep :fail|grep :write|head
2025-10-21 13:51:15,370{GMT} INFO [jepsen worker 26] jepsen.util: 26 :fail :write [13 8841] 500
2025-10-21 13:51:30,380{GMT} INFO [jepsen worker 45] jepsen.util: 45 :fail :write [22 4690] 500
2025-10-21 13:51:45,390{GMT} INFO [jepsen worker 10] jepsen.util: 10 :fail :write [5 3185] 500
2025-10-21 13:52:00,401{GMT} INFO [jepsen worker 37] jepsen.util: 37 :fail :write [18 8880] 500
2025-10-21 13:52:30,423{GMT} INFO [jepsen worker 86] jepsen.util: 86 :fail :write [43 9875] 500
2025-10-21 13:53:00,445{GMT} INFO [jepsen worker 77] jepsen.util: 77 :fail :write [38 2919] 500
reads began failing at 13:56 after the third node was rebooted (fourth reboot overall)
$ cat crucible_jepsen_20251021T134348/jepsen.log |grep :fail|grep :read|head
2025-10-21 13:56:07,813{GMT} INFO [jepsen worker 47] jepsen.util: 47 :fail :read [23 nil] 500
2025-10-21 13:56:07,814{GMT} INFO [jepsen worker 17] jepsen.util: 17 :fail :read [8 nil] 500
2025-10-21 13:56:07,814{GMT} INFO [jepsen worker 74] jepsen.util: 74 :fail :read [37 nil] 500
2025-10-21 13:56:07,814{GMT} INFO [jepsen worker 69] jepsen.util: 69 :fail :read [34 nil] 500
2025-10-21 13:56:07,814{GMT} INFO [jepsen worker 64] jepsen.util: 64 :fail :read [32 nil] 500
2025-10-21 13:56:07,815{GMT} INFO [jepsen worker 34] jepsen.util: 34 :fail :read [17 nil] 500
2025-10-21 13:56:07,815{GMT} INFO [jepsen worker 50] jepsen.util: 50 :fail :read [25 nil] 500
2025-10-21 13:56:07,815{GMT} INFO [jepsen worker 33] jepsen.util: 33 :fail :read [16 nil] 500
2025-10-21 13:56:07,816{GMT} INFO [jepsen worker 65] jepsen.util: 65 :fail :read [32 nil] 500
2025-10-21 13:56:07,816{GMT} INFO [jepsen worker 49] jepsen.util: 49 :fail :read [24 nil] 500
$ grep 'All three' crucible_jepsen_20251021T134348/172.21.252.194/opt/crucible/crest.log |head -n 1
{"msg":"All three downstairs require repair","v":0,"name":"crucible","level":40,"time":"2025-10-21T13:56:14.017850081Z","hostname":"jepsen-n4","pid":3951,"":"downstairs","":"downstairs","session_id":"0adf013c-2b49-46cd-8285-ffd06df5b927"}
My understanding is that I am leaving plenty of time between the node reboots and it should have enough time to repair before the next reboot.