Today EvalAlgorithmInterface.evaluate is typed to return List[EvalOutput] ("for dataset(s)", per the docstring), but its dataset_config argument only accepts Optional[DataConfig].
It seems like most concrete eval algorithms (like QAAccuracy here) either take the user's data_config for a single dataset, or take all the pre-defined DATASET_CONFIGS relevant to the evaluator's problem type.
...So the internal logic of evaluators is set up to support providing multiple datasets and returning multiple results already, but we seem to prevent users from calling evaluate() with multiple of their own datasets for no particular reason?