Skip to content

Conversation

@Beichen-Ma
Copy link

As #530, try to reproduce with scripts/run-qwen3-14B.sh. The script succeeds with no failure.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Beichen-Ma, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a dedicated test script for the Qwen3-14B model, designed to validate its functionality and reproduce a specific issue. The script orchestrates a complex training and evaluation workflow using Ray for distributed execution, configuring numerous parameters for model loading, data processing, performance tuning, and optimization. Its primary impact is to provide a robust and reproducible environment for testing this particular model.

Highlights

  • New Test Script Added: A new bash script, scripts/run-qwen3-14B.sh, has been introduced to facilitate testing and reproduction of issues related to the Qwen3-14B model.
  • Ray and SGLang Integration: The script includes commands to manage Ray and SGLang processes, ensuring a clean execution environment and leveraging Ray for distributed job submission.
  • Comprehensive Model Configuration: Various argument arrays are defined within the script to configure checkpoint paths, rollout parameters, evaluation settings, performance optimizations (e.g., tensor parallelism, recomputation), GRPO, Adam optimizer, and SGLang specific settings for the train.py script.
  • Environment Variable Setup: The script dynamically sets critical environment variables like PYTHONPATH, CUDA_DEVICE_MAX_CONNECTIONS, and NCCL_NVLS_ENABLE based on NVLink detection, ensuring proper runtime configuration.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Beichen-Ma
Copy link
Author

Hi @zhaochenyang20, I put the script here for reference.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a test script for Qwen3-14B. A critical security concern has been identified: the Ray dashboard and its Job Submission API are exposed to the network without authentication, which could allow unauthorized users to execute arbitrary code on the training cluster. Beyond this, the review also addresses general issues to improve the script's robustness, correctness, and portability, such as correcting an environment variable for Python output buffering, removing redundant cleanup commands, parameterizing hardcoded paths, and fixing a hardcoded IP in the ray job submit command. Minor shell scripting best practice issues, like unquoted variables, were also noted.


# launch the master node of ray in container
export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"}
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The Ray dashboard is configured to listen on all interfaces (0.0.0.0) without authentication via the --dashboard-host=0.0.0.0 flag. The Ray dashboard includes a Job Submission API that allows executing arbitrary code on the cluster. Since Ray does not enable authentication for the dashboard by default, exposing it to the network allows any user with network access to achieve Remote Code Execution (RCE) on the training cluster. It is recommended to bind the dashboard to 127.0.0.1 and use SSH tunneling for remote access, or ensure the dashboard is protected by a firewall. Additionally, the ${MASTER_ADDR} variable should be quoted to prevent word splitting.

Suggested change
ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265
ray start --head --node-ip-address "${MASTER_ADDR}" --num-gpus 8 --disable-usage-stats --dashboard-host=127.0.0.1 --dashboard-port=8265

}
}"

ray job submit --address="http://127.0.0.1:8265" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The ray job submit command uses a hardcoded IP address 127.0.0.1. However, MASTER_ADDR is made configurable earlier in the script for ray start. If MASTER_ADDR is set to something other than 127.0.0.1, this command will fail to connect to the Ray head node. To ensure correctness, you should use the ${MASTER_ADDR} variable here.

Suggested change
ray job submit --address="http://127.0.0.1:8265" \
ray job submit --address="http://${MASTER_ADDR}:8265" \

Comment on lines +9 to +11
sleep 3
pkill -9 ray
pkill -9 python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These pkill commands are redundant. The ray stop --force and the first set of pkill commands on lines 6-8 should be sufficient to clean up processes from a previous run. Repeating them is unnecessary.

set -ex

# will prevent ray from buffering stdout/stderr
export PYTHONBUFFERED=16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The environment variable to disable python's output buffering is PYTHONUNBUFFERED, not PYTHONBUFFERED. The value is typically set to 1. The current setting is likely having no effect.

Suggested change
export PYTHONBUFFERED=16
export PYTHONUNBUFFERED=1

Comment on lines +30 to +35
--hf-checkpoint /root/Qwen3-14B
#--hf-checkpoint /root/Qwen3-4B-FP8
--ref-load /root/Qwen3-14B_torch_dist
--load /root/Qwen3-14B_miles/
--save /root/Qwen3-14B_miles/
--save-interval 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The paths here are hardcoded, which makes the script less portable. It's a good practice to parameterize these paths using an environment variable for the root directory. You can define a variable like DATA_ROOT=${DATA_ROOT:-/root} at the beginning of the script and use it here. This principle applies to other hardcoded paths in the script as well.

Suggested change
--hf-checkpoint /root/Qwen3-14B
#--hf-checkpoint /root/Qwen3-4B-FP8
--ref-load /root/Qwen3-14B_torch_dist
--load /root/Qwen3-14B_miles/
--save /root/Qwen3-14B_miles/
--save-interval 20
--hf-checkpoint "${DATA_ROOT}/Qwen3-14B"
#--hf-checkpoint "${DATA_ROOT}/Qwen3-4B-FP8"
--ref-load "${DATA_ROOT}/Qwen3-14B_torch_dist"
--load "${DATA_ROOT}/Qwen3-14B_miles/"
--save "${DATA_ROOT}/Qwen3-14B_miles/"
--save-interval 20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant