All IO failing after rolling downstairs node reboots

In trying to reproduce https://github.com/oxidecomputer/crucible/issues/1788 I put together https://github.com/oxidecomputer/crucible_jepsen

So far I have not triggered that issue, but have been able to reliably trigger an availability issue after restarting each downstairs node.

The jepsen test is doing the following:

On each ds node:

    # once during setup
    rm -rf /opt/crucible/data
    /opt/crucible/crucible-downstairs create -u $random-uuid -d /opt/crucible/data --encrypted

    # via smf
    /opt/crucible/crucible-downstairs run -p 3810 -d /opt/crucible/data

On the upstairs:

    # https://github.com/oxidecomputer/crucible/blob/rest-server-for-jepsen/rest_server/src/main.rs
    /opt/crucible/crucible-rest-server -g 1 --key rgcfQ6nDFT5ZuMeqrUGBW2txZ9tZh/zQUc9ZnhnwDH4= --listen 0.0.0.0:9999 --target=172.21.252.170:3810 --target=172.21.252.173:3810 --target=172.21.252.174:3810

Then, during the test, ds nodes are randomly rebooted using:

    mdb -kwe 'clock/W -1'

For the test:

    172.21.252.194: upstairs / crucible-rest-server
    172.21.252.170: ds
    172.21.252.173: ds
    172.21.252.174: ds

Most recent test:

    $ lein run  test --nodes-file my.nodes  --concurrency 100 --time-limit 1800
    # wait 30 minutes

 The issue is easy to see in the graphs jepsen produces:

<img width="900" height="400" alt="Image" src="https://github.com/user-attachments/assets/63153441-0275-4607-8e6d-e9473193bf37" />

The left edge of the gray sections at the top correspond to the reboot being triggered.  Jepsen isn't directly controlling the VM coming back up, so the width of the gray section is approximate.. the actual 'outage' is however long the node takes to reboot and smf to restart the downstairs.

    $cat crucible_jepsen_20251021T134348/jepsen.log |egrep -e Relat -e 'Rebooting'|sed "s/.clojure.*ble://"
    2025-10-21 13:44:01,364{GMT}	INFO	[jepsen test runner] jepsen.core: Relative time begins now
    2025-10-21 13:46:31,607{GMT}    INFO     172.21.252.173 Rebooting in 5 seconds
    2025-10-21 13:49:36,963{GMT}    INFO     172.21.252.170 Rebooting in 5 seconds
    2025-10-21 13:52:42,335{GMT}    INFO     172.21.252.173 Rebooting in 5 seconds
    2025-10-21 13:55:47,718{GMT}    INFO     172.21.252.174 Rebooting in 5 seconds
    2025-10-21 13:58:53,090{GMT}    INFO     172.21.252.170 Rebooting in 5 seconds
    2025-10-21 14:01:58,427{GMT}    INFO     172.21.252.173 Rebooting in 5 seconds

first failed writes were at 13:51 after the second node reboot:

    $ cat crucible_jepsen_20251021T134348/jepsen.log |grep :fail|grep :write|head
    2025-10-21 13:51:15,370{GMT}    INFO    [jepsen worker 26] jepsen.util: 26      :fail   :write  [13 8841]       500
    2025-10-21 13:51:30,380{GMT}    INFO    [jepsen worker 45] jepsen.util: 45      :fail   :write  [22 4690]       500
    2025-10-21 13:51:45,390{GMT}    INFO    [jepsen worker 10] jepsen.util: 10      :fail   :write  [5 3185]        500
    2025-10-21 13:52:00,401{GMT}    INFO    [jepsen worker 37] jepsen.util: 37      :fail   :write  [18 8880]       500
    2025-10-21 13:52:30,423{GMT}    INFO    [jepsen worker 86] jepsen.util: 86      :fail   :write  [43 9875]       500
    2025-10-21 13:53:00,445{GMT}    INFO    [jepsen worker 77] jepsen.util: 77      :fail   :write  [38 2919]       500

reads began failing at 13:56 after the third node was rebooted (fourth reboot overall)

    $ cat crucible_jepsen_20251021T134348/jepsen.log |grep :fail|grep :read|head
    2025-10-21 13:56:07,813{GMT}    INFO    [jepsen worker 47] jepsen.util: 47      :fail   :read   [23 nil]        500
    2025-10-21 13:56:07,814{GMT}    INFO    [jepsen worker 17] jepsen.util: 17      :fail   :read   [8 nil] 500
    2025-10-21 13:56:07,814{GMT}    INFO    [jepsen worker 74] jepsen.util: 74      :fail   :read   [37 nil]        500
    2025-10-21 13:56:07,814{GMT}    INFO    [jepsen worker 69] jepsen.util: 69      :fail   :read   [34 nil]        500
    2025-10-21 13:56:07,814{GMT}    INFO    [jepsen worker 64] jepsen.util: 64      :fail   :read   [32 nil]        500
    2025-10-21 13:56:07,815{GMT}    INFO    [jepsen worker 34] jepsen.util: 34      :fail   :read   [17 nil]        500
    2025-10-21 13:56:07,815{GMT}    INFO    [jepsen worker 50] jepsen.util: 50      :fail   :read   [25 nil]        500
    2025-10-21 13:56:07,815{GMT}    INFO    [jepsen worker 33] jepsen.util: 33      :fail   :read   [16 nil]        500
    2025-10-21 13:56:07,816{GMT}    INFO    [jepsen worker 65] jepsen.util: 65      :fail   :read   [32 nil]        500
    2025-10-21 13:56:07,816{GMT}    INFO    [jepsen worker 49] jepsen.util: 49      :fail   :read   [24 nil]        500

    $ grep 'All three' crucible_jepsen_20251021T134348/172.21.252.194/opt/crucible/crest.log |head -n 1
    {"msg":"All three downstairs require repair","v":0,"name":"crucible","level":40,"time":"2025-10-21T13:56:14.017850081Z","hostname":"jepsen-n4","pid":3951,"":"downstairs","":"downstairs","session_id":"0adf013c-2b49-46cd-8285-ffd06df5b927"}

My understanding is that I am leaving plenty of time between the node reboots and it should have enough time to repair before the next reboot.

[crucible_jepsen_20251021T134348.tar.gz](https://github.com/user-attachments/files/23023055/crucible_jepsen_20251021T134348.tar.gz)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

All IO failing after rolling downstairs node reboots #1796

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

All IO failing after rolling downstairs node reboots #1796

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions