Skip to content

How to reproduce the results of Terminal 2.0? #6

@archersama

Description

@archersama

Hello, I would like to ask how to reproduce the performance of Minimax 2.1 on Terminal-Bench 2.0? As for the framework, I am using Claude Code within Harbor; for the model, I am using MiniMax-M2.1 deployed on Openrouter, with the provider fixedly set to the official Minimax. I have run it four times, and the results obtained are 32/89 (35.96%), 35/89 ( 39.33%), 30/89 (33.71%), and 34/89(38.20). The average score of these four runs is 37.11. Additionally, I have removed the timeout limit. Could you tell me how to achieve a score of 47.9?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions