Optimize javm interpreter performance

## Goal

Improve the javm interpreter performance. The interpreter is the reference backend — it must produce identical results and gas consumption as the recompiler. Open-ended issue for incremental optimizations.

## Architecture (capability-javm-v2)

The interpreter lives in `crates/javm/src/interpreter/mod.rs`. It uses a pre-decoded instruction array for fast dispatch:

- **Pre-decode** (`predecode_instructions`): raw PVM bytecode → flat `Vec<DecodedInst>` with resolved branch targets, pre-computed gas block costs, and flattened operands (ra, rb, rd, imm1, imm2)
- **Execution** (`run_segment`): sequential instruction dispatch via match on `DecodedInst.opcode`, with pre-resolved `next_idx` / `target_idx` for branches
- **Gas metering**: per-gas-block charge at block entry (same pipeline model as recompiler)

### Key types
- `DecodedInst`: pre-decoded instruction with opcode, flat operands, gas cost, pre-resolved next/target indices
- `Interpreter`: PVM state (registers, PC, gas, code, memory pointer, basic_block_starts)
- `InterpreterProgram`: pre-decoded program (instructions, pc_to_idx mapping)

### Multi-VM context
The interpreter runs within the same kernel as the recompiler. VM switching (CALL/REPLY) is handled by the kernel — the interpreter just executes segments and returns exit reasons. The kernel selects interpreter vs recompiler via `PvmBackend` / `GREY_PVM` env var.

## Benchmark suite

```bash
# Single-VM workloads (interpreter columns)
cargo bench -p grey-bench --bench pvm_bench -- 'interpreter'

# Multi-VM workload
GREY_PVM=interpreter cargo bench -p grey-bench --bench subvm_bench

# Compare interpreter vs recompiler
cargo bench -p grey-bench --bench pvm_bench

# Full comparison including polkavm
POLKAVM_ALLOW_EXPERIMENTAL=1 POLKAVM_DEFAULT_COST_MODEL=full-l1-hit cargo bench -p grey-bench
```

## Optimization areas

**Dispatch overhead:**
- Current: match-based dispatch on `DecodedInst.opcode`
- Threaded/computed-goto dispatch (requires unsafe + function pointer table)
- Profile-guided optimization of opcode ordering in the match

**Pre-decode improvements:**
- Instruction fusion during pre-decode (e.g., load_imm + add → add_imm)
- Specialized fast paths for common instruction sequences
- Pack `DecodedInst` tighter (currently ~56 bytes per instruction — cache pressure)

**Memory access:**
- Current: bounds-checked via `read_u8`/`write_u8` etc. with `% (1u64 << 32)` masking
- Consider batch bounds checking or page-table-based dispatch

**Gas metering:**
- Gas block costs are pre-computed — charge is a single subtract + sign check per block
- Investigate whether the branch on negative gas is a significant branch misprediction source

**Pre-decode cost:**
- Pre-decode runs once per `InterpreterProgram::predecode()` call
- For short-lived programs, this is a significant fraction of total time
- Consider lazy pre-decode (decode on first execution of each block)

## Rules

- Always benchmark before AND after. Use criterion's built-in comparison.
- If a change shows no measurable improvement or regresses, revert it.
- Interpreter must produce **identical** results and gas consumption as recompiler — `cargo test -p grey-bench` verifies this.
- Do not use `polkavm` or `polkavm-common` crates — implement from first principles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize javm interpreter performance #400

Goal

Architecture (capability-javm-v2)

Key types

Multi-VM context

Benchmark suite

Optimization areas

Rules

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Optimize javm interpreter performance #400

Description

Goal

Architecture (capability-javm-v2)

Key types

Multi-VM context

Benchmark suite

Optimization areas

Rules

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions