It would useful to collect system metrics, e.g. latency, during the evaluation and to provide a summary in the evaluation output.