-
Notifications
You must be signed in to change notification settings - Fork 85
Refactor Hint Retrieval #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: generic_agent_hinter
Are you sure you want to change the base?
Refactor Hint Retrieval #295
Conversation
Based on your review schedule, I'll hold off on reviewing this PR until it's marked as ready for review. If you'd like me to take a look now, comment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
Category | Issue | Status |
---|---|---|
Static parallel job allocation ▹ view | ||
Invalid LLM Model Reference ▹ view | ||
In-function NumPy Import ▹ view | ||
Ignored Parallel Backend Parameter ▹ view | ||
Unsafe Environment Variable Loading ▹ view | ||
Unsafe Study Data Loading ▹ view | ||
Unhandled dictionary key access ▹ view | ||
Undocumented numeric parameters ▹ view | ||
Missing Ray Backend Validation ▹ view | ||
Missing configuration value documentation ▹ view |
Files scanned
File Path | Reviewed |
---|---|
experiments/generic/run_generic_agent.sh | ✅ |
experiments/hinter/run_hinter_agent.sh | ✅ |
experiments/generic/run_generic_agent.py | ✅ |
experiments/hinter/run_hinter_agent.py | ✅ |
src/agentlab/utils/hinting.py | ✅ |
src/agentlab/agents/generic_agent_hinter/generic_agent.py | ✅ |
src/agentlab/agents/generic_agent_hinter/generic_agent_prompt.py | ✅ |
src/agentlab/agents/tool_use_agent/tool_use_agent.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
PARALLEL_BACKEND="ray" | ||
|
||
N_JOBS=5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static parallel job allocation 
Tell me more
What is the issue?
The script sets a fixed number of parallel jobs (N_JOBS=5) without considering the host system's CPU resources.
Why this matters
Without adapting to available CPU cores, this could lead to either underutilization of system resources or resource contention, impacting overall performance.
Suggested change ∙ Feature Preview
Dynamically set N_JOBS based on available CPU cores. Add the following before N_JOBS assignment:
# Use 75% of available CPU cores by default
N_JOBS=$(( $(nproc) * 3 / 4 ))
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
||
BENCHMARK="workarena_l1" | ||
|
||
LLM_CONFIG="azure/gpt-5-mini-2025-08-07" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid LLM Model Reference 
Tell me more
What is the issue?
The script references a non-existent GPT model with a future date (2025-08-07), which will cause the program to fail.
Why this matters
The program will fail to run as it cannot connect to a model that doesn't exist yet, preventing the experiment from executing.
Suggested change ∙ Feature Preview
Replace with an existing GPT model configuration, for example:
LLM_CONFIG="azure/gpt-4-0613"
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
import numpy as np | ||
rng = np.random.default_rng(42) | ||
rng.shuffle(benchmark.env_args_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In-function NumPy Import 
Tell me more
What is the issue?
NumPy is imported inside the function rather than at module level, causing unnecessary import overhead on each function call.
Why this matters
Importing modules inside functions creates overhead as Python needs to process the import each time the function is called. This is especially important in performance-critical applications or when the function is called frequently.
Suggested change ∙ Feature Preview
Move the NumPy import to the top of the file with other imports:
import numpy as np
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
parser.add_argument("--parallel-backend", type=str, default="ray") | ||
parser.add_argument("--reproducibility-mode", action="store_true") | ||
|
||
args = parser.parse_args() | ||
|
||
# instantiate agent | ||
agent_args = [get_base_agent(args.llm_config)] | ||
benchmark = DEFAULT_BENCHMARKS[args.benchmark]() | ||
|
||
##################### Shuffle env args list, pick subset | ||
import numpy as np | ||
rng = np.random.default_rng(42) | ||
rng.shuffle(benchmark.env_args_list) | ||
benchmark.env_args_list = benchmark.env_args_list[:33] | ||
##################### | ||
|
||
# for env_args in benchmark.env_args_list: | ||
# env_args.max_steps = 100 | ||
|
||
if args.relaunch: | ||
# relaunch an existing study | ||
study = Study.load_most_recent(contains=None) | ||
study.find_incomplete(include_errors=True) | ||
|
||
else: | ||
study = Study( | ||
agent_args, | ||
benchmark, | ||
logging_level=logging.WARNING, | ||
logging_level_stdout=logging.WARNING, | ||
) | ||
|
||
study.run( | ||
n_jobs=args.n_jobs, | ||
parallel_backend="ray", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignored Parallel Backend Parameter 
Tell me more
What is the issue?
The parallel_backend argument is hardcoded in study.run() despite accepting it as a command-line argument, making the CLI parameter ineffective.
Why this matters
Ignoring the user-specified parallel backend could lead to suboptimal performance if the user has chosen a backend better suited for their specific workload or environment.
Suggested change ∙ Feature Preview
Use the command-line argument in the study.run() call:
study.run(
n_jobs=args.n_jobs,
parallel_backend=args.parallel_backend,
strict_reproducibility=args.reproducibility_mode,
n_relaunch=args.n_relaunch,
)
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
from dotenv import load_dotenv | ||
|
||
load_dotenv() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe Environment Variable Loading 
Tell me more
What is the issue?
Unconditional loading of environment variables without error handling or path specification.
Why this matters
If the .env file is missing or inaccessible, the application will continue without environment variables, potentially exposing sensitive configuration or causing runtime errors if required variables are missing.
Suggested change ∙ Feature Preview
from dotenv import load_dotenv
import sys
if not load_dotenv():
print("Error: Failed to load .env file")
sys.exit(1)
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
||
if args.relaunch: | ||
# relaunch an existing study | ||
study = Study.load_most_recent(contains=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsafe Study Data Loading 
Tell me more
What is the issue?
Loading arbitrary most recent study data without validation or access control.
Why this matters
Without proper access control or validation, the code could load sensitive or malicious study data from the filesystem that was placed there by another user.
Suggested change ∙ Feature Preview
# Add path validation and access control
study_path = Study.get_most_recent_path(contains=None)
if not is_safe_study_path(study_path): # implement this function to validate path
raise SecurityError("Invalid or unauthorized study path")
study = Study.load_most_recent(contains=None)
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
|
||
# instantiate agent | ||
agent_args = [get_base_agent(args.llm_config)] | ||
benchmark = DEFAULT_BENCHMARKS[args.benchmark]() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unhandled dictionary key access 
Tell me more
What is the issue?
Dictionary access of DEFAULT_BENCHMARKS with user input is not wrapped in a try-catch block to handle KeyError exceptions.
Why this matters
If an invalid benchmark name is provided, the program will crash with an uncaught KeyError instead of providing a helpful error message.
Suggested change ∙ Feature Preview
try:
benchmark = DEFAULT_BENCHMARKS[args.benchmark]()
except KeyError:
print(f"Error: '{args.benchmark}' is not a valid benchmark. Available benchmarks: {list(DEFAULT_BENCHMARKS.keys())}")
exit(1)
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
N_JOBS=5 | ||
N_RELAUNCH=3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Undocumented numeric parameters 
Tell me more
What is the issue?
The numerical configuration values lack explanation of their purpose and constraints.
Why this matters
Without context, it's not clear what these numbers control or what ranges are appropriate.
Suggested change ∙ Feature Preview
# Number of parallel jobs to run (recommended: 1-10)
N_JOBS=5
# Number of retry attempts for failed jobs (recommended: 1-5)
N_RELAUNCH=3
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
# PARALLEL_BACKEND="sequential" | ||
PARALLEL_BACKEND="ray" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing Ray Backend Validation 
Tell me more
What is the issue?
The script doesn't validate if Ray is properly installed and initialized before using it as the parallel backend.
Why this matters
Without proper Ray initialization checks, the program may fail at runtime if Ray is not available in the environment.
Suggested change ∙ Feature Preview
Add Ray availability check before running the script:
# Check if Ray is available
if [ "$PARALLEL_BACKEND" = "ray" ]; then
python -c "import ray" > /dev/null 2>&1 || { echo "Error: Ray is not installed"; exit 1; }
fi
python experiments/generic/run_generic_agent.py \
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
BENCHMARK="workarena_l1" | ||
|
||
LLM_CONFIG="azure/gpt-5-mini-2025-08-07" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing configuration value documentation 
Tell me more
What is the issue?
The hardcoded values lack comments explaining what they represent and what valid options are available.
Why this matters
Without documentation, future maintainers won't know what other benchmark types or LLM configurations are valid choices.
Suggested change ∙ Feature Preview
# Benchmark type to run (options: workarena_l1, workarena_l2, etc)
BENCHMARK="workarena_l1"
# LLM configuration path (format: provider/model-name-version)
LLM_CONFIG="azure/gpt-5-mini-2025-08-07"
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
Description by Korbit AI
What change is being made?
Refactor hint retrieval system by introducing
HintsSource
class and updating the generic agent to incorporate hint retrieval including various methods like direct, LLM, and embeddings, as well as new scripts for generic and hinter agents.Why are these changes being made?
Previously, hint retrieval functionality was scattered and redundant, making it difficult to manage and extend. This refactor consolidates hint retrieval into a dedicated
HintsSource
class, providing a cleaner, reusable, and more maintainable approach. It also enhances the capability of hint retrieval using different methods, allowing for flexible experimentation with poor performing hints alleviation.