feat: Why does evaluation run significantly slower than training when using the same data and the same rollout size?

### Describe the feature

Why does evaluation run significantly slower than training when using the same data and the same rollout size?

### Why do you need this feature?

_No response_

### Additional context

_No response_