From 0bc6fce1e467464d0198528f03dbb8f32c31cafc Mon Sep 17 00:00:00 2001 From: Paul Dowman Date: Fri, 28 Mar 2025 17:31:29 -0600 Subject: [PATCH 1/2] Add empty draft of FMA for interop proofs --- security/fma-interop-proofs.md | 106 +++++++++++++++++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 security/fma-interop-proofs.md diff --git a/security/fma-interop-proofs.md b/security/fma-interop-proofs.md new file mode 100644 index 00000000..f21a0d07 --- /dev/null +++ b/security/fma-interop-proofs.md @@ -0,0 +1,106 @@ +# [Interop Proofs]: Failure Modes and Recovery Path Analysis + + + + +- [Introduction](#introduction) +- [Failure Modes and Recovery Paths](#failure-modes-and-recovery-paths) + - [[Name of Failure Mode 1]](#name-of-failure-mode-1) + - [[Name of Failure Mode 2]](#name-of-failure-mode-2) +- [Audit Requirements](#audit-requirements) +- [Action Items](#action-items) +- [Appendix](#appendix) + - [Appendix A: This is a Placeholder Title](#appendix-a-this-is-a-placeholder-title) + + + +_Italics are used to indicate things that need to be replaced._ + +| | | +| ------------------ | -------------------------------------------------- | +| Author | _Author Name_ | +| Created at | 2025-03-28 | +| Initial Reviewers | _Reviewer Name 1, Reviewer Name 2_ | +| Need Approval From | _Security Reviewer Name_ | +| Status | Draft | + +> [!NOTE] +> πŸ“’ Remember: +> +> - The single approver in the β€œNeed Approval From” must be from the Security team. +> - Maintain the β€œStatus” property accordingly. An FMA document can have the following statuses: +> - **Draft πŸ“:** Doc is created but not yet ready for review. +> - **In Review πŸ”Ž:** Security is reviewing, and Engineering is iterating on the design. A checklist of action items will be created during this phase. +> - **Implementing Actions πŸ›«:** Security has signed off on the content of the document, including the resulting action items. Engineering is responsible for implementing the action items, and updating the checklist. +> - **Final πŸ‘:** Security will transition the status of the document to Final once all action items are completed. + +> [!TIP] +> Guidelines for writing a good analysis, and what the reviewer will look for: +> +> - Show your work: Include steps and tools for each conclusion. +> - Completeness of risks considered. +> - Include both implementation and operational failure modes +> - Provide references to support the reviewer. +> - The size of the document will likely be proportional to the project's complexity. +> - The ultimate goal of this document is to identify action items to improve the security of the project. The FMA review process can be accelerated by proactively identifying action items during the writing process. + +## Introduction + +This document covers _[project name, high-level summary of the project, and scope of this analysis]._ + +Below are references for this project: + +- _Link 1, e.g. project charter or design doc_ +- _Link 2, etc._ + +## Failure Modes and Recovery Paths + +**_Use one sub-header per failure mode, so the full set of failure modes is easily scannable from the table of contents._** + +### FM1: [Name of Failure Mode 1] + +- **Description:** _Details of the failure mode go here. What the causes and effects of this failure?_ +- **Risk Assessment:** _Simple low/medium/high rating of impact (severity) + likelihood._ +- **Mitigations:** _What mechanisms are in place, or what should we add, to:_ + 1. _reduce the chance of this occurring?_ + 2. _reduce the impact of this occurring?_ +- **Detection:** _How do we detect if this occurs?_ +- **Recovery Path(s)**: _How do we resolve this? Is it a simple, quick recovery or a big effort? Would recovery require a governance vote or a hard fork?_ + +### FM2: [Name of Failure Mode 2] + +- **Description:** _Details of the failure mode go here. What the causes and effects of this failure?_ +- **Risk Assessment:** _Simple low/medium/high rating of impact (severity) + likelihood._ + **Mitigations:** _What mechanisms are in place, or what should we add, to:_ + 1. _reduce the chance of this occurring?_ + 2. _reduce the impact of this occurring?_ +- **Detection:** _How do we detect if this occurs?_ +- **Recovery Path(s)**: _How do we resolve this? Is it a simple, quick recovery or a big effort? Would recovery require a governance vote or a hard fork?_ + +### Generic items we need to take into account: + +See [generic hardfork failure modes](./fma-generic-hardfork.md) and [generic smart contract failure modes](./fma-generic-contracts.md). +Incorporate any applicable failure modes with FMA-specific mitigations and detections directly into this document. + +- [ ] Check this box to confirm that these items have been considered and updated if necessary. + +## Action Items + +Below is what needs to be done before launch to reduce the chances of the above failure modes occurring, and to ensure they can be detected and recovered from: + +- [ ] Resolve all comments on this document and incorporate them into the document itself (Assignee: document author) +- [ ] _Action item 2 (Assignee: tag assignee)_ +- [ ] _Action item 3 (Assignee: tag assignee)_ + +## Audit Requirements + +_Given the failure modes and action items, will this project require an audit? See [FMAs in the SDLC](https://github.com/ethereum-optimism/pm/blob/main/src/fmas.md#determine-audit-requirements) for a reference decision making framework. Please explain your reasoning._ + +## Appendix + +### Appendix A: This is a Placeholder Title + +_Appendices must include any additional relevant info, processes, or documentation that is relevant for verifying and reproducing the above info. Examples:_ + +- _If you used certain tools, specify their versions or commit hashes._ +- _If you followed some process/procedure, document the steps in that process or link to somewhere that process is defined._ From 542e4f862e0fe76e778a1e6708d10ecdb2f0d287 Mon Sep 17 00:00:00 2001 From: Adrian Sutton Date: Thu, 8 May 2025 14:56:33 +1000 Subject: [PATCH 2/2] Add initial failure modes for interop proofs. --- security/fma-interop-proofs.md | 213 ++++++++++++++++++++++++++------- 1 file changed, 169 insertions(+), 44 deletions(-) diff --git a/security/fma-interop-proofs.md b/security/fma-interop-proofs.md index f21a0d07..2327fa07 100644 --- a/security/fma-interop-proofs.md +++ b/security/fma-interop-proofs.md @@ -16,23 +16,31 @@ _Italics are used to indicate things that need to be replaced._ -| | | -| ------------------ | -------------------------------------------------- | -| Author | _Author Name_ | -| Created at | 2025-03-28 | -| Initial Reviewers | _Reviewer Name 1, Reviewer Name 2_ | -| Need Approval From | _Security Reviewer Name_ | -| Status | Draft | +| | | +|--------------------|--------------------------| +| Author | Adrian Sutton | +| Created at | 2025-05-08 | +| Initial Reviewers | Mofi Taiwo, Paul Dowman | +| Need Approval From | _Security Reviewer Name_ | +| Status | Draft | > [!NOTE] > πŸ“’ Remember: > > - The single approver in the β€œNeed Approval From” must be from the Security team. > - Maintain the β€œStatus” property accordingly. An FMA document can have the following statuses: -> - **Draft πŸ“:** Doc is created but not yet ready for review. -> - **In Review πŸ”Ž:** Security is reviewing, and Engineering is iterating on the design. A checklist of action items will be created during this phase. -> - **Implementing Actions πŸ›«:** Security has signed off on the content of the document, including the resulting action items. Engineering is responsible for implementing the action items, and updating the checklist. -> - **Final πŸ‘:** Security will transition the status of the document to Final once all action items are completed. + > + +- **Draft πŸ“:** Doc is created but not yet ready for review. + +> - **In Review πŸ”Ž:** Security is reviewing, and Engineering is iterating on the design. A checklist of action items + will be created during this phase. + > + +- **Implementing Actions πŸ›«:** Security has signed off on the content of the document, including the resulting action + items. Engineering is responsible for implementing the action items, and updating the checklist. + +> - **Final πŸ‘:** Security will transition the status of the document to Final once all action items are completed. > [!TIP] > Guidelines for writing a good analysis, and what the reviewer will look for: @@ -42,65 +50,182 @@ _Italics are used to indicate things that need to be replaced._ > - Include both implementation and operational failure modes > - Provide references to support the reviewer. > - The size of the document will likely be proportional to the project's complexity. -> - The ultimate goal of this document is to identify action items to improve the security of the project. The FMA review process can be accelerated by proactively identifying action items during the writing process. +> - The ultimate goal of this document is to identify action items to improve the security of the project. The FMA + review process can be accelerated by proactively identifying action items during the writing process. ## Introduction -This document covers _[project name, high-level summary of the project, and scope of this analysis]._ +This document covers the changes made to the fault proof system to support interop. These are details in the +[interop fault proofs spec](https://specs.optimism.io/interop/fault-proof.html) -Below are references for this project: +### Shared DisputeGameFactory -- _Link 1, e.g. project charter or design doc_ -- _Link 2, etc._ +The `DisputeGameFactory` is now shared by all chains in a dependency set. Proposals are made using super roots which +include the output root for all chains in the dependency set. A single fault dispute game is used to decide the validity +of the state of all chains. -## Failure Modes and Recovery Paths +### SuperFaultDisputeGame and SuperPermissionedDisputeGame + +Two new game implementations are provided: + +1. `SuperFaultDisputeGame` which replaces `FaultDisputeGame` +2. `SuperPermissionedDisputeGame` which replaces `PermissionedDisputeGame` + +Both of these contracts are largely the same as the pre-interop versions they replace. The key changes include: + +* Removing code to support challenging the L2 block number (now handled in the fault proof program) +* Preventing games being created with the root claim `keccak("invalid")` which is used as a marker for an always invalid + state +* Modifying the L2BlockNumber local key to always populate the PreimageOracle with the proposal timestamp for the game, + regardless of the claim's position in the dispute tree + +### Multistage Fault Proof Program -**_Use one sub-header per failure mode, so the full set of failure modes is easily scannable from the table of contents._** +The fault proof program (`op-program` and `kona`, though only `op-program` is currently in production) has been updated +to use separate steps to derive the chain. This ensures that the fault proof VM can execute each step in a reasonable +amount of time. In particular it ensures that a single step only needs to execute the full block from a single chain, +keeping resource usage roughly equivalent to pre-interop. There are a fixed 128 steps per timesstamp transition. The +first two steps reproduce the next block for a single chain from the L1 batch data without verifying executing messages. +Following steps are a no-op, to simplify the addition of more chains in the future. The final step verifies executing +messages across all chains and replaces any blocks found to be invalid with deposit-only blocks. -### FM1: [Name of Failure Mode 1] +### Invalid Proposal Timestamp Handling -- **Description:** _Details of the failure mode go here. What the causes and effects of this failure?_ -- **Risk Assessment:** _Simple low/medium/high rating of impact (severity) + likelihood._ -- **Mitigations:** _What mechanisms are in place, or what should we add, to:_ - 1. _reduce the chance of this occurring?_ - 2. _reduce the impact of this occurring?_ -- **Detection:** _How do we detect if this occurs?_ -- **Recovery Path(s)**: _How do we resolve this? Is it a simple, quick recovery or a big effort? Would recovery require a governance vote or a hard fork?_ +The L2 block number challenge is removed from the contracts and replaced by the fault proof program transitioning to +the invalid state (`keccak("invalid")`) if data is unavailable on L1 to reach the proposal timestamp. Previously, +reaching the L1 head was considered the end of derivation and simple trace extension was applied. This left ambiguity +as to whether the derivation had reached the proposal block (indicating it is valid) or had reach the L1 head (invalid). -### FM2: [Name of Failure Mode 2] +### op-challenger Updates -- **Description:** _Details of the failure mode go here. What the causes and effects of this failure?_ -- **Risk Assessment:** _Simple low/medium/high rating of impact (severity) + likelihood._ - **Mitigations:** _What mechanisms are in place, or what should we add, to:_ - 1. _reduce the chance of this occurring?_ - 2. _reduce the impact of this occurring?_ -- **Detection:** _How do we detect if this occurs?_ -- **Recovery Path(s)**: _How do we resolve this? Is it a simple, quick recovery or a big effort? Would recovery require a governance vote or a hard fork?_ +`op-challenger` has been updated to handle the new game types. `op-supervisor` is now used as its primary source of +truth. The `TraceProvider` used for the top half of the game is a new implementation that follows the multi-step state +transition used by interop. + +### op-dispute-mon Updates + +`op-dispute-mon` has been updated to support the name game types. `op-supervisor` is now used as its primary source of +truth for proposals, using super roots instead of output roots. Monitoring is otherwise unchanged. + +## Failure Modes and Recovery Paths + +### FM1: Unavailability of Preimage Data + +- **Description:** + - The final consolidation step validates executing messages. To do this it must have access to the receipts from the + block as derived from L1 batch data. + - When that block is invalid, it will have been re-orged out of the canonical chain by honest nodes and may have been + pruned from its database. + - As the fault proof program derives the original optimistic block in a separate step it does not have the receipts. +- **Risk Assessment:** + - Medium impact, low likelihood +- **Mitigations:** + - op-geth does not currently prune data from non-canonical blocks so would always have the required data available. + - Interop proofs action tests run using a source node that only contains canonical blocks. + - We periodically execute op-program using inputs from op-mainnet and op-sepolia. This periodic cannon + runner ([vm-runner]) runs on oplabs infrastructure using source nodes available to op-challenger. The vm-runner + samples game inputs for the latest L2 safe head every 2 hours and uses cannon to execute the op-program using the + sampled inputs. Note that this sampling does not include every game created. +- **Detection:** + - Existing monitoring for forecast or actual invalid game resolution +- **Recovery Path(s):** + - Follow the [Fault Proof Recovery Runbook] + - Deploy a fixed op-challenger. Does not require governance approval or a hard fork. + - If sufficient context is not available from the preimage hint emitted by the fault proof program, deploy an updated + prestate with additional information in the hint. Requires governance approval but can be done without a hard fork. + +### FM2: Problem in Game Design + +- **Description:** + - A problem in the design of the dispute game, in particular in the multistep super root state transition, could lead + to games resolving incorrectly. + - This may include valid proposals being invalidated and invalid proposals being confirmed. +- **Risk Assessment:** + - Medium impact, low likelihood +- **Mitigations:** + - [Suite of action tests] verifying the sequence of expected states to transition between super roots in a wide + variety of cases. +- **Detection:** + - Existing monitoring for forecast and actual incorrect game results. +- **Recovery Path(s):** + - Follow the [Fault Proof Recovery Runbook] + - A fixed design would need to be devised, implemented, approved by governance and deployed. + +### FM3: Mismatch Between op-challenger and op-program + +- **Description:** + - To ensure it always wins games, op-challenger must calculate claim values that match the claim op-program considers + correct. + - A bug in op-challenger or op-program could cause these claims to be mismatched. + - This could lead to games resolving incorrectly, including valid proposals being invalidated and invalid proposals + being confirmed. +- **Risk Assessment:** + - Medium impact, low likelihood +- **Mitigations:** + - [Suite of action tests] verifying the honest trace used by op-challenger matches the expected claims by op-program. + - Bottom half of the game continues to post cannon states as in pre-interop games. +- **Detection:** + - Existing monitoring for forecast and actual incorrect game results. + - We periodically execute op-program in [vm-runner] using inputs from op-mainnet and op-sepolia. The expected output + is calculated using the same op-challenger trace provider as is used to calculate the correct claims when playing + games. +- **Recovery Path(s):** + - Follow the [Fault Proof Recovery Runbook] + - If the bug is in op-challenger, deploy a fixed version. Does not require governance approval or a hard fork. + - If the bug is in op-program, deploy a fix via a new prestate. Requires governance approval but can be done without a + hard fork. + +### FM4: Increased Maximum Preimage Size + +- **Description:** + - The consolidation step introduces the first requirement for op-program to read receipts from L2 chains. + - Since L2 chains have higher gas limits it is possible to create larger receipts in a single block which may need to + be added into the PreimageOracle. + - The larger preimages may incur higher costs for the honest actor to populate the `PreimageOracle`. + - The preimage can still be posted via the existing large preimage proposal process which op-challenger automatically + utilises. +- **Risk Assessment:** + - Low impact, low likelihood +- **Mitigations:** + - Preimages only need to be populated into the `PreimageOracle` when the max depth is reached and `step()` is called. + This requires posting at least 300 ETH in bonds. +- **Detection:** + - [Existing monitoring for large preimage proposals](https://github.com/ethereum-optimism/k8s/blob/0eb3b759ecfe52ed36ece0531a559f11e699419f/grafana-cloud/terraform-rules/challenger.yaml#L94-L102) +- **Recovery Path(s):** + - Automatically handled by op-challenger + - Potentially increase bond sizes via new dispute game implementation deployment (requires governance approval). ### Generic items we need to take into account: -See [generic hardfork failure modes](./fma-generic-hardfork.md) and [generic smart contract failure modes](./fma-generic-contracts.md). +See [generic hardfork failure modes](./fma-generic-hardfork.md) +and [generic smart contract failure modes](./fma-generic-contracts.md). Incorporate any applicable failure modes with FMA-specific mitigations and detections directly into this document. -- [ ] Check this box to confirm that these items have been considered and updated if necessary. +- [x] Check this box to confirm that these items have been considered and updated if necessary. ## Action Items -Below is what needs to be done before launch to reduce the chances of the above failure modes occurring, and to ensure they can be detected and recovered from: +Below is what needs to be done before launch to reduce the chances of the above failure modes occurring, and to ensure +they can be detected and recovered from: - [ ] Resolve all comments on this document and incorporate them into the document itself (Assignee: document author) -- [ ] _Action item 2 (Assignee: tag assignee)_ -- [ ] _Action item 3 (Assignee: tag assignee)_ + ## Audit Requirements -_Given the failure modes and action items, will this project require an audit? See [FMAs in the SDLC](https://github.com/ethereum-optimism/pm/blob/main/src/fmas.md#determine-audit-requirements) for a reference decision making framework. Please explain your reasoning._ +An audit has been performed for op-program and is scheduled for the overall interop system, including fault proofs. ## Appendix -### Appendix A: This is a Placeholder Title +### Appendix A: Additional Interop FMAs + +Other interop FMAs have some overlap with the proofs system: + +* [Supervisor FMA](./fma-supervisor.md) includes failure modes around unavailability or incorrect results from op-supervisor +* [Portal FMA](./fma-interop-portal.md) includes failure modes related to the contract migration + +[Suite of action tests]: https://github.com/ethereum-optimism/optimism/blob/fa86f81da6bed8489508907ed0956134d029c09f/op-e2e/actions/interop/proofs_test.go -_Appendices must include any additional relevant info, processes, or documentation that is relevant for verifying and reproducing the above info. Examples:_ +[Fault Proof Recovery Runbook]: https://www.notion.so/oplabs/Fault-Proofs-Recovery-Runbook-8dad0f1e6d4644c281b0e946c89f345f?pvs=4 -- _If you used certain tools, specify their versions or commit hashes._ -- _If you followed some process/procedure, document the steps in that process or link to somewhere that process is defined._ +[vm-runner]: https://github.com/ethereum-optimism/optimism/tree/develop/op-challenger/runner