Skip to content

Commit cd8efdb

Browse files
fix: Lower resolved_threshold default from 0.8 to 0.0 for dead code benchmarks
The 80% precision AND recall gate meant every task showed "Resolved: False" for both MCP and baseline agents. No dead code detection approach achieves 80% on both metrics simultaneously. Setting to 0.0 means any task with non-zero P and R counts as resolved. Still configurable via config YAML.
1 parent e35a179 commit cd8efdb

File tree

3 files changed

+3
-3
lines changed

3 files changed

+3
-3
lines changed

src/mcpbr/benchmarks/deadcode.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,7 +145,7 @@ def __init__(
145145
self,
146146
dataset: str | Path = "",
147147
corpus_path: str | Path | None = None,
148-
resolved_threshold: float = 0.8,
148+
resolved_threshold: float = 0.0,
149149
):
150150
"""Initialize the benchmark.
151151

src/mcpbr/benchmarks/supermodel/benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ def __init__(
5050
tasks: list[dict[str, Any]] | None = None,
5151
supermodel_api_base: str = "https://api.supermodel.dev",
5252
supermodel_api_key: str | None = None,
53-
resolved_threshold: float = 0.8,
53+
resolved_threshold: float = 0.0,
5454
ground_truth_dir: str | Path | None = None,
5555
supermodel_api_timeout: int = 900,
5656
**kwargs: Any,

src/mcpbr/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -901,7 +901,7 @@ def validate_thinking_budget(cls, v: int | None) -> int | None:
901901
)
902902

903903
resolved_threshold: float = Field(
904-
default=0.8,
904+
default=0.0,
905905
ge=0.0,
906906
le=1.0,
907907
description="Recall threshold to consider a task resolved (must be in [0.0, 1.0])",

0 commit comments

Comments
 (0)