Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fma: FMA for MT/64-bit Cannon #123

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

fma: FMA for MT/64-bit Cannon #123

wants to merge 16 commits into from

Conversation

pauldowman
Copy link
Contributor

This is the failure modes analysis for multi-threaded & 64-bit Cannon.

@pauldowman pauldowman force-pushed the pauldowman/cannon-fma branch from 117c0bd to eded6d0 Compare October 11, 2024 17:08
@pauldowman
Copy link
Contributor Author

@Inphi @mbaxter I did rebase and force push this but I won't do that again so that we can share the branch if you want to make edits in here.

@Inphi Inphi force-pushed the pauldowman/cannon-fma branch from 417ef09 to db0efd3 Compare November 26, 2024 21:35
@Inphi Inphi marked this pull request as ready for review December 5, 2024 19:59
@Inphi
Copy link
Contributor

Inphi commented Dec 5, 2024

@pauldowman could you fill in the "Initial reviewers" and "Need approval from" fields in the table.

@Inphi Inphi changed the title fma: Draft FMA for MT/64-bit Cannon fma: FMA for MT/64-bit Cannon Dec 10, 2024

An audit of the multithreaded VM is not required per the [OP Labs Audit Framework](https://gov.optimism.io/t/op-labs-audit-framework-when-to-get-external-security-review-and-how-to-prepare-for-it/6864).
A failure in the new Cannon VM and thus dispute games is mitigated by an airgap in finalized withdrawals. Furthermore, there's a window whereby the Security Council can override the results of invalid games.
Nonetheless, we will be auditing the new VM.
Copy link
Contributor

@BlocksOnAChain BlocksOnAChain Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Inphi - we should add audit report links here, once we have the final report from our external audits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlocksOnAChain Out of curiosity when you will have the results for the audit of the VM? Did the PR will be merged before the results here?

Copy link
Contributor

@BlocksOnAChain BlocksOnAChain Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ethnical - We will have the results in January, likely mid January since we will be on a collective pause + auditors need time to generate the reports.
We are still doing reviews with the Auditors, so all of this brings us to January as our target date.
I started adding the findings from the audit to this label, feel free to review what we have, for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Inphi Reminder to put the link audit here when we have them :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll follow up and add the audit report to the monorepo and reference them here.

- **Recovery Path(s)**: Reschedule upgrade, possibly releasing new binary though without immediate urgency.


## Action Items
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mds1 - any immidiate FMA action items that we should add to the list, after your initial pass for the MT cannon FMA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** We periodically use Cannon to execute the op-program using inputs from op-mainnet and op-sepolia. This periodic cannon runner (vm-runner) runs on oplabs infrastructure.
Furthermore, we [sanitize](https://github.com/ethereum-optimism/optimism/blob/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/Makefile#L51) the op-program [in CI](https://github.com/ethereum-optimism/optimism/blob/eabf70498f68f321f5de003f1d443d3e3c8100b8/.circleci/config.yml#L928C1-L929C111) for unsupported opcodes.
- **Detection:** Alerting is setup to notify the proofs team whenever the vm-runner fails to complete a cannon run. And the CI check provides an early warning against unsupported opcodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it seems that the only early detection is coming from the CI.
This is making me wondering about the potential hole there:

  1. We maintain the CI for a long period of time.
  2. As no one introduce new opcodes this CI test will never be generating fail.
  3. After a certain time, we decide to clean the CI tests because there a taking time to run (and this test is not matching often).
  4. Then a invalid Opcode is introduced.

For me, seems bit light to only rely on the CI here.
Happy to discuss about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Added clarification on the vm-runner being used to detect issues outside of CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should link to the alerting to verify this setup. Also, is this a page to oncall because a failure means unsupported opcode (which is good since there's a clear DRI), or is failure just a slack alert that could get missed since it has no DRI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a reference to the vm-runner alerting rules here. I'll also follow up to adjust the alerting priorities of critical vm-runner issues.

@Ethnical
Copy link
Contributor

Ethnical commented Dec 17, 2024

@Inphi For the failures that require monitoring like op-dispute-mon for the detection.
I would like to know if a scenario that exhaust or DoS op-program can also DoS the op-dispute-mon.
I am looking for case where we can crash the CHALLENGER and also the monitoring, thus making the game resolving incorrectly but impossible to detect and thus allowing an malicious actor to exploit the bridge.
In other words is there is part of the op-program code shared with op-dispute-mon.
I am thinking about case like Insufficient memory in the program or the Unimplemented syscalls or opcodes needed by op-program.

- **Description:** This could theoretically occur when the op-program runs out of memory in a way that lets the attacker reuse code to subvert execution.
- **Risk Assessment:** High severity, low likelihood.
- Low likelihood: This requires an attacker to craft inputs that not only induce high memory usage, but also corrupt or spray the heap in a way that either produces invalid fault proofs or prevents valid fault proofs from being generated.
- **Mitigations:** As with [Insufficient memory in the program](#insufficient-memory-in-the-program), the 64-bit address space effectively prevents this from occurring. Furthermore, the Go runtime checks memory allocations against heap corruption. However, such memory protections may not hold due to bugs in the Go runtime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Maybe add that not only vulnerability inside Go can cause this behavior but also the usage of unsafe in Go can lead to unexpected behavior.
PS: We should also ensure that no unsafe package is imported and used incorrectly in the current codebase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note that unsafe usage can cause unexpected behavior. This is grossly unlikely to happen given existing testing.

| | |
|--------|--------------|
| Author | Paul Dowman, Mofi Taiwo |
| Created at | *2024-10-09* |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this was originally written in october, is this still up to date, i.e. has the design changed at all? One notable difference is we now know we'll use OPCM, which is something we should add to the generic contract failure modes

- **Risk Assessment:** High severity, Low likelihood.
- **Mitigations:** Comprehensive testing. This includes full test coverage of every supported MIPS instruction, threading semantics, and verifying op-program execution on live chain data.
This includes [unit and fuzz](https://github.com/ethereum-optimism/optimism/tree/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/mipsevm/tests) testing of MIPS instructions and Linux syscalls. It also includes [testing](https://github.com/ethereum-optimism/optimism/blob/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/mipsevm/multithreaded/state_test.go) of multithreaded specific functionality.
- **Detection:** op-dispute-mon forecasts and alerts on undesirable game resolutions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's link to dispute mon here


### Unimplemented syscalls or opcodes needed by `op-program`

- **Description:** We only aim to implement syscalls and opcodes that are required by `op-program` so there are some unimplemented. The risk is that there is some previously untested code path that uses an opcode or syscall that we haven't implemented and this code path ends up being exercised by an input condition some time in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for not implementing unused syscalls and opcodes to mitigate this failure mode? I'm guessing either (1) there are so many that it'd not feasible, or (2) implementation and verification of each is time consuming? It would be good to expand either the description, or mitigation, to answer this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note that it's not feasible, given time, to implement a full blown Linux VM that supports all syscalls. Implementing all syscalls also complicates the PFVM code, increasing odds that something is not emulated correctly.


- **Description:** We only aim to implement syscalls and opcodes that are required by `op-program` so there are some unimplemented. The risk is that there is some previously untested code path that uses an opcode or syscall that we haven't implemented and this code path ends up being exercised by an input condition some time in the future.
- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** We periodically use Cannon to execute the op-program using inputs from op-mainnet and op-sepolia. This periodic cannon runner (vm-runner) runs on oplabs infrastructure. The vm-runner samples game inputs for the latest L2 safe head every 2 hours and uses cannon to execute the op-program using the sampled inputs. Note that this sampling does not include every game created.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's link to vm-runner

- **Description:** The op-program may run out of memory, causing it to crash.
- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** The 64-bit address space virtually eliminates memory exhaustion risks. Go's concurrent garbage collector automatically manages memory through scheduled background goroutines.
- **Detection:** op-dispute-mon forecasts and alerts on undesirable game resolutions that would result due to a program crash.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about adding op-dispute-mon link

- **Risk Assessment:** High severity, low likelihood.
- Low likelihood: This requires an attacker to craft inputs that not only induce high memory usage, but also corrupt or spray the heap in a way that either produces invalid fault proofs or prevents valid fault proofs from being generated.
- **Mitigations:** As with [Insufficient memory in the program](#insufficient-memory-in-the-program), the 64-bit address space effectively prevents this from occurring. Furthermore, the Go runtime checks memory allocations against heap corruption. However, such memory protections may not hold due to bugs in the Go runtime.
- **Detection:** op-dispute-mon forecasts and alerts on undesirable game resolutions that would result due to honest claims being disputed at the bottom of the game tree.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about adding op-dispute-mon link

- **Description:** This is when there is not a known prestate (preimage) for a given absolute prestate hash in the dispute game implementations.
- **Risk Assessment:** High severity, Low likelihood.
- **Mitigations:** Every absolute prestate is built off of an op-program release tag. The prestate is [build is reproducible](https://github.com/ethereum-optimism/optimism/blob/68f77aaa317b9184cbbcd1526bc57bce1722906b/op-program/Dockerfile.repro) such that the same prestate is emitted regardless of the environment.
Furthermore, governance and Guardian signers will be instructed to reproduce the prestate build themselves and check that the prestate hash matches the op-program release that will be referenced in the MT-Cannon governance post.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about a dedicated validations file in superchain-ops with generic instructions for this, so it can be referenced by all playbooks with prestate changes? Similar to how https://github.com/ethereum-optimism/superchain-ops/blob/main/NESTED-VALIDATION.md applies to all nested safe playbooks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add an AI to add instructions for prestate validation to the superchain-ops repo.


### Failure to run correct VM based on absolute prestate input

- **Description:** The off-chain Cacurrent version of the nnon [attempts to run the correct VM version based on the absolute prestate input](https://github.com/ethereum-optimism/design-docs/blob/0034943e42b8ab5f9dd9ded2ef2b6b55359c922c/cannon-state-versioning.md). If it doesn't work correctly the on-chain steps would not match.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Cacurrent and nnon typos or domain-specific words? If the latter, can we define them inline or link to definitions? Since currently I cannot really understand this failure mode

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems to be "CA" and "nnon" from "CAnoon" this is a typo likely you are right!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix.

- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** [Diffeerential testing](https://github.com/ethereum-optimism/optimism/tree/eabf70498f68f321f5de003f1d443d3e3c8100b8/cannon/mipsevm/tests) asserts identical on-chain and off-chain execution.
- **Detection:** An op-challenger fails to fault prove an invalid claim using a witness generated offchain.
- **Recovery Path(s)**: Depends on the specifics. If the onchain VM implementation is "more correct", then fixing this can be done solely offchain. Otherwise, a governance vote will be needed. As usual, the [Fault Proof Recovery](https://www.notion.so/oplabs/RB-000-Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f) provides the best guidance on this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "more correct" mean? Is this covered in the runbook?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More correct here refers to the quality of the emulation. I'll tweak this paragraph to make that clearer.


- **Description:** A livelocked execution prevents an honest challenger from generating a fault proof.
- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** Manual review of the op-program and a quick review of Go runtime internals. The op-program uses 3 threads, and only one of those threads is used by the mutator main function. This makes livelocks very unlikely. This [issue](https://github.com/ethereum-optimism/optimism/issues/11979) looks into the livelock problem with possible solutions. The proposed solutions are deferred for future work as the risk of a livelock is considered too low to be addressed immediately.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is now closed, but it says "To future-proof the guest program, we should revisit this". Where are we tracking revisiting this issue/mitigation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked issue isn't a mitigation and only serves as context on our investigation into possible mitigations. But we won't be pursuing those mitigations in the forseeable future. I'll remove it to avoid confusion in the FMA.

Comment on lines +142 to +144
As such, there will be a brief moment where there are two sets of `CANNON` games that using singlethreaded and multithreaded VMs.

- Description: This occurs when either the call to the DisputeGameFactory could not be made due to grossly unfavorable base fees on L1, an invalidly approved safe nonce, or a successful execution to a misconfigured dispute game implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this failure mode. Do you mean that there will be unfinalized games of both types for a brief moment of up to a few days (until the very last ST-Cannon game resolves)? The "successful execution to a misconfigured dispute game implementation" aspect makes sense, but the rest of it I don't understand. Would the transaction just not be executed if we e.g. signed the wrong nonce?

Comment on lines 157 to 158
- [ ] Third-party audit the offchain and onchain VM implementation and specification (Assignee: @inphi)
- [ ] Add a healthcheck for the vm-runner (Assignee: @pauldowman)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are either of these completed now? If so we can mark them as such and include a link to the applicable artifacts


An audit of the multithreaded VM is not required per the [OP Labs Audit Framework](https://gov.optimism.io/t/op-labs-audit-framework-when-to-get-external-security-review-and-how-to-prepare-for-it/6864).
A failure in the new Cannon VM and thus dispute games is mitigated by an airgap in finalized withdrawals. Furthermore, there's a window whereby the Security Council can override the results of invalid games.
Nonetheless, we will be auditing the new VM.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the prior text of "an audit is not needed given the audit framework" it would be good to explain why we chose to get an audit

- **Description:** The op-program may run out of memory, causing it to crash.
- **Risk Assessment:** High severity, low likelihood.
- **Mitigations:** The 64-bit address space virtually eliminates memory exhaustion risks. Go's concurrent garbage collector automatically manages memory through scheduled background goroutines.
- **Detection:** op-dispute-mon forecasts and alerts on undesirable game resolutions that would result due to a program crash.
Copy link
Contributor

@Ethnical Ethnical Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also monitore the RAM of the machine that run op-program and raise an alert if the RAM is higher of 500% in the last X minutes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants