-
Notifications
You must be signed in to change notification settings - Fork 177
Description
While doing XEN tests I got this error:
FAILED tests/benchmark/mainnet/test_state_xen.py::test_xen_approve_delete_existing_slots[fork_Prague-blockchain_test] - AssertionError: Total gas used (44960131) does not match expected benchmark gas (22500000), difference: 22460131
(Yes this difference is huge, and in this case it is certainly wrong!)
However there are certainly scenarios and tests where we cannot get the exact amount of gas we expect. Here are two scenarios:
- A scenario where we do writes to storage. This could be anything which does SSTORE, LOGs, contract creations, value transfers, etc. We do not want that these transactions run out-of-gas, because this will invalidate all storage writes and this will be rolled back (we want these storage writes to be part of the benchmark, e.g. XEN benchmark where we are mainly interested in the storage changes). So, for these scenarios, if we have a benchmark test where we are given
gas_benchmark_value
in the test, we should (if we want to spend exactly this amount of gas) find a way to carefully generate this test to spend exactly that (and not one more (OOG), or one less (this fails the assertion)).
In order to get this exact amount (which is theoretically possible) we would have to use trick which will use operations which are not part of the scenario which we want to benchmark. (For instance insert a final tx which just spends all gas - it is OK if this goes OOG. This final tx however is not part of the opcodes/scenario we want to bench and thus adds noise to the measurement)
- Any scenario where the final transaction in the block has a refund. In this case, above tricks will not work, because if the tx gets the storage deletion refund (setting a storage value from nonzero to zero), which is 4800, it will thus leave at least 4800 gas in the block. In this case we cannot insert another transaction because there is not enough gas left to pay the intrinsic tx fee of 21000.
Note: for scenarios where we do not write to storage, e.g. execution opcode benchmarks like doing a lot of MULs, EXPs, etc. here it does not matter if we OOG, and in fact most (if not all) benchmark tests are written this way.
I believe this test: https://github.com/ethereum/execution-spec-tests/blob/10b02048237f917c8ae33b827ba92fca39d48c4a/tests/benchmark/test_worst_stateful_opcodes.py#L383C13-L383C27 will fail the gas used assumption after execution (when checking receipts)
The other check is a pre-check which check if the total gas limit of the txs matches the benchmark gas. This works in most cases, but not in this scenario (for XEN): https://github.com/ethereum/execution-spec-tests/pull/2101/files#diff-0b867281827e877c1cb1a1351bdcf0d93c8b8aa767c7390fd6fb113bbeae4db1R372 (note: this code is currently still ugly). What this does is it first sends a tx with the gas_benchmark_value (or in this case the block gas limit). But due to it writing massive amounts of storage slots from nonzero to zero we will get the maximum refund of 1/5. So, we now have room for 1/5 of the gas_benchmark_value left in the block and we can insert new txs! (Test inserts in total 5 txs, which have a total gas limit for the currently hardcoded 60M gas of 60 + 60/5+60/25+60/125+60/(125*5) = 74.976M gas (which approaches that the refund is capped at 80% of the tx execution gas, thus we can actually execute 1/0.8 * 60 = 75M gas here in theory)
Seeing the points here: #1945 (comment) I am also convincing Nethermind that they cannot rely on exact gas values, for some benchmarks this is just either hard or impossible to craft this, and the gas value should be read from the block.
However this check itself is a very good assumption. We should allow the measured gas to be slightly different from the expected gas. However if it is above a certain threshold then the benchmark is certainly wrong (and we should throw).
Ok this was a wall of text again but wanted to also give some background on scenarios where this does not hold. This sanity check should however be done, because in some scenarios writing these benchmarks is rather hard if you have to consider all possible gas limits. It is nice to error or at least warn if the execution gas is not between some threshold values.