Skip to content

Conversation

mralj
Copy link
Contributor

@mralj mralj commented Mar 11, 2025

📝 Summary

rbuilder can now work with any node via IPC.
This means that any node can provide the state for rbuilder, revm is still used as the EVM.
Node requirements are that it exposes the following RPC calls:

Calls not available in the ETH JSON RPC spec:

  1. rbuilder_calculateStateRoot for state root calculation
  2. rbuilder_getCodeByHash given Bytecode hash, returns Bytecode

Calls optimised for rbuilder, but have a counterpart in the ETH JSON RPC spec:

  1. rbuilder_getBlockHash gets the block hash given block number (similar to eth_getBlockByNumber, but just returns block hash)
  2. rbuilder_getAccount gets account info (similar to eth_getProof, but w/o proof stuff)

To use rbuilder with node via IPC, the config.toml must have the following config (example):

[ipc_provider]
ipc_path = "/root/execution-data/nethermind.ipc"
mempool_server_url = "ws://localhost:8546"
request_timeout_ms = 75

Implementation details

IPC

This implementation was initially intended to introduce a remote state provider. By remote, I mean that the idea was that state could be provided via HTTP/WS/IPC. Unfortunately, due to implementation issues/constraints, I've decided only to implement state provisioning via IPC.
I don't think this has any practical downside, especially since the state provider must be fast. There is a non-trivial number of calls to read state (~300/s), meaning it would be unrealistic to use this over the network and have near disk read latency.

Code-wise, issues above and constraints stem mainly from the fact that the traits to read state are sync. Initially, I relied on tokio and alloy.rs to fetch this remote state, but this implementation had many issues.
Firstly, each call to fetch any data (e.g. fetching account or Bytecode) had to be wrapped in a function call like this:

    /// Runs fututre in sync context
    // StateProvider(Factory) traits require sync context, but calls to remote provider are async
    // What's more, rbuilder is executed in async context, so we have a situation
    // async -> sync -> async
    // This helper function allows execution in such environment
    fn run<F, R>(&self, f: F) -> R
    where
        F: Future<Output = R>,
    {
        tokio::task::block_in_place(|| self.runtime_handle.block_on(f))
    }

This adds additional overhead on Tokio runtime and doesn't play well with some parts of the codebase, specifically mutex locking. Not to go too deep into explaining issues around this in the PR description, but we would end up in scenarios where the whole Tokio runtime I/O would be blocked or we would dead-lock parking-lot mutexes (in some scenarios).

The solutions (monitoring thread + async mutex) seemed hacky and suboptimal.
This is why, in the end, I reached for sync (but concurrent) IPC solution, which I implemented from scratch here. It's called REIPC (coming from request/response IPC).

This solution was tested using Nethermind node, and while Nethermind will do some improvements on IPC to reduce latency and increase throughput, here are the initial request&response latencies:

CleanShot 2025-03-01 at 11 44 11@2x

Dashmap & QuickCache in IPC provider

We need caches because otherwise the number of IPC calls would be pretty high (in thousands per sec).
I also reached for concurrent caches because both StateProviderFactory and StateProvider need to be Send + Sync.

QuickCache is used so that I don't have to implement concurrent-cache invalidation by hand :)
Here is some info on QuickCache vs Moka (other popular caching crate), TL;DR; for our simple case QuickCache seems a better fit.

On enum StateProviderFactories

The reason cli.rs passes StateProviderFactories enum to config.new_builder is because I wanted the state provider for rbuilder to be chosen via config at the runtime.
This is why, AFAIK, static dispatch is not an option. So I was left with the choice of refactoring to dynamic dispatch (akin to StateProviderBox) OR the enum solution.
I chose the enum solution for following reasons:
1. It seemed to me that code diff would be smaller (smaller change to implement)
2. AFIAK it's faster. Enum matching will compile to JMP vs CALL (in case of dynamic dispatch), which compiler will be able to optimize better (especially in this scenario), and given branch predicitoning I guess that it'll be almost free (+ no need for vtable/pointer loading).]

UPDATE: this was properly handled in this commit thanks to suggestions from @ZanCorDX

On MempoolSource enum

The reason I chose to use WebSockets for streaming transactions when rbuilder uses IPC state provider is because, currently, REIPC doesn't support ETH subscriptions/streams.

✅ I have completed the following steps:

  • Run make lint
  • Run make test
  • Added tests (if applicable)

@mralj mralj force-pushed the mralj/ipc-state-provider-reipc branch from 2624095 to a9c6535 Compare March 11, 2025 18:06
/// Gets block header given block hash
fn header_by_number(&self, num: u64) -> ProviderResult<Option<Header>> {
let block = rpc_call::<
_,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of using concrete type here. This will help in case we will do any refactoring and accidentally change parameters. This way we would be able to find problem instantly.

Copy link
Contributor Author

@mralj mralj Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I completely missed the comment 🤦‍♂️
I disagree; the type here is inferred from the function arguments, i.e., (num, false). ' If we mess it up, rust will yell at us.
That being said, I don't feel strongly about this if @ZanCorDX agrees with you, I'll refactor :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, will it yell if we change params to (num, false, false) for example? If yes - all good
But i assume it will send them to rpc and get error because there is extra false param

Copy link
Contributor Author

@mralj mralj Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, it won't :)
So the idea is that if we have written this out in 2 places (in the type and in the args) there is smaller chance of us making mistake in both places when refactoring?

ie. If we change one and not the other compiler will give us error, and we'll stop to think about i (hopefully not just copying args to type, because then compiler will tell you everything is ok and you'll still get rpc error xD)

But all in all I agree this reduces the chance of developer error, I'll fix it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SozinM done :)
Sorry it took me so long!

fbe5087

@dvush
Copy link
Contributor

dvush commented Mar 12, 2025

Great! Thanks, I'll review it this week. (cc @ZanCorDX don't merge before pls)

@mralj

This comment was marked as resolved.

@mralj mralj force-pushed the mralj/ipc-state-provider-reipc branch from 1b2b61b to 2dc4f41 Compare March 14, 2025 11:01
@mralj mralj mentioned this pull request Mar 15, 2025
3 tasks
ZanCorDX pushed a commit that referenced this pull request Mar 17, 2025
## 📝 Summary
When testing #489, I noticed that when processing new orders, if one of
them fails, we stop further processing.
This PR addresses this issue.

## 💡 Motivation and Context

The effect of the above-explained behaviour, in the worst case scenario,
is that the whole slot can _fail_, meaning if the processing of the new
order fails for the very first order, the whole slot ends up being
empty, and in non-worse scenario we loose % of transactions, depending
on when the processing was stopped.

### When processing new order fails?
The processing of the new orders will fail only if the _StateProvider_
was unable to _provide_ the Nonce, i.e. the _StateProvider_ errors.
This is edge-case, but this can happen. 

I don't see any downsides with this approach, so this is why I opened PR
instead of an issue. (I'm not saying hat there aren't any, just that _I_
don't see them 🙂)

### Process new simulations
Similar argument can be made for the processing new simulations,
specifically for the following:

```Rust
    pub fn submit_simulation_tasks_results(
        &mut self,
        results: Vec<SimulatedResult>,
    ) -> Result<(), ProviderError> {
        for result in results {
            // NOTE: we can refactor the following line to
            // let _ = self.process_simulation_task_result(result);
            self.process_simulation_task_result(result)?;
        }
        Ok(())
    }
```
The code above only errors in case of the _nonce issues_, and in that
scenario we are stopping processing of all simulation tasks.
I haven't pushed this change because I wanted so se how you feel about
the this, especially since simulations _seem more serious_ than orders.

---

## ✅ I have completed the following steps:

* [x] Run `make lint`
* [x] Run `make test`
* [ ] Added tests (if applicable)
@mralj mralj force-pushed the mralj/ipc-state-provider-reipc branch 2 times, most recently from 8fece37 to 71283e6 Compare March 24, 2025 09:30
@mralj mralj force-pushed the mralj/ipc-state-provider-reipc branch from 71283e6 to 7514d49 Compare March 27, 2025 15:16
Copy link
Contributor

@dvush dvush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, pretty clear and isolated change. I've been running it locally and it works fine.

The only thing that I notices is root hash is much slower on this version. I can give more data tomorrow when I run the default rbuilder. But if feels like its 2-3x slower.

I think our root hash implementation (eth-sparse-mpt) is quite good for the block building purposes and it will be much better than generic root hash when we finish all the possible improvements and it can be used for integrations with other clients through API. The only thing that it needs is ability to fetch proofs using path. Instead of default eth_getProof that asks for address we need to fetch proof by hash(address) and hash(slot) for the given hash(address)/address

ipc_provider: RpcProvider,

code_cache: Arc<DashMap<B256, Bytecode>>,
state_provider_by_hash: Arc<Cache<BlockHash, Arc<IpcStateProvider>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please clarify for me the difference between DashMap and Cache here?

Copy link
Contributor Author

@mralj mralj Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code_cache can live as long as rbuilder lives, i.e., it doesn't have to be invalidated because it will stay the same no matter the current state.

On the other hand, we need to clean state_provider_by_hash after we are done with the StateProvider for some block, because after slot is finished, this data becomes useless and just grows.
The reason I went with Cache here instead of Dashmap is because I didn't want to write concurrent cache invalidation myself :)

@mralj
Copy link
Contributor Author

mralj commented Mar 27, 2025

I think our root hash implementation (eth-sparse-mpt) is quite good for the block building purposes and it will be much better than generic root hash when we finish all the possible improvements and it can be used for integrations with other clients through API. The only thing that it needs is ability to fetch proofs using path. Instead of default eth_getProof that asks for address we need to fetch proof by hash(address) and hash(slot) for the given hash(address)/address.

Thanks for pointing this out, I'll discuss with my (NMC) team and see how we can leverage this, I'll also add some logging to our node to see how much milli seconds State Root Calculation actually takes and will get back to you on numbers :)

@mralj mralj force-pushed the mralj/ipc-state-provider-reipc branch from fbe5087 to eed8474 Compare March 28, 2025 15:46
@ZanCorDX ZanCorDX merged commit 9aed6b7 into flashbots:develop Mar 28, 2025
2 of 4 checks passed
@mralj mralj deleted the mralj/ipc-state-provider-reipc branch April 2, 2025 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants