-
Notifications
You must be signed in to change notification settings - Fork 94
Qwen3-14B Test Script #532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @Beichen-Ma, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a dedicated test script for the Qwen3-14B model, designed to validate its functionality and reproduce a specific issue. The script orchestrates a complex training and evaluation workflow using Ray for distributed execution, configuring numerous parameters for model loading, data processing, performance tuning, and optimization. Its primary impact is to provide a robust and reproducible environment for testing this particular model. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Hi @zhaochenyang20, I put the script here for reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a test script for Qwen3-14B. A critical security concern has been identified: the Ray dashboard and its Job Submission API are exposed to the network without authentication, which could allow unauthorized users to execute arbitrary code on the training cluster. Beyond this, the review also addresses general issues to improve the script's robustness, correctness, and portability, such as correcting an environment variable for Python output buffering, removing redundant cleanup commands, parameterizing hardcoded paths, and fixing a hardcoded IP in the ray job submit command. Minor shell scripting best practice issues, like unquoted variables, were also noted.
|
|
||
| # launch the master node of ray in container | ||
| export MASTER_ADDR=${MASTER_ADDR:-"127.0.0.1"} | ||
| ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Ray dashboard is configured to listen on all interfaces (0.0.0.0) without authentication via the --dashboard-host=0.0.0.0 flag. The Ray dashboard includes a Job Submission API that allows executing arbitrary code on the cluster. Since Ray does not enable authentication for the dashboard by default, exposing it to the network allows any user with network access to achieve Remote Code Execution (RCE) on the training cluster. It is recommended to bind the dashboard to 127.0.0.1 and use SSH tunneling for remote access, or ensure the dashboard is protected by a firewall. Additionally, the ${MASTER_ADDR} variable should be quoted to prevent word splitting.
| ray start --head --node-ip-address ${MASTER_ADDR} --num-gpus 8 --disable-usage-stats --dashboard-host=0.0.0.0 --dashboard-port=8265 | |
| ray start --head --node-ip-address "${MASTER_ADDR}" --num-gpus 8 --disable-usage-stats --dashboard-host=127.0.0.1 --dashboard-port=8265 |
| } | ||
| }" | ||
|
|
||
| ray job submit --address="http://127.0.0.1:8265" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ray job submit command uses a hardcoded IP address 127.0.0.1. However, MASTER_ADDR is made configurable earlier in the script for ray start. If MASTER_ADDR is set to something other than 127.0.0.1, this command will fail to connect to the Ray head node. To ensure correctness, you should use the ${MASTER_ADDR} variable here.
| ray job submit --address="http://127.0.0.1:8265" \ | |
| ray job submit --address="http://${MASTER_ADDR}:8265" \ |
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| set -ex | ||
|
|
||
| # will prevent ray from buffering stdout/stderr | ||
| export PYTHONBUFFERED=16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| --hf-checkpoint /root/Qwen3-14B | ||
| #--hf-checkpoint /root/Qwen3-4B-FP8 | ||
| --ref-load /root/Qwen3-14B_torch_dist | ||
| --load /root/Qwen3-14B_miles/ | ||
| --save /root/Qwen3-14B_miles/ | ||
| --save-interval 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The paths here are hardcoded, which makes the script less portable. It's a good practice to parameterize these paths using an environment variable for the root directory. You can define a variable like DATA_ROOT=${DATA_ROOT:-/root} at the beginning of the script and use it here. This principle applies to other hardcoded paths in the script as well.
| --hf-checkpoint /root/Qwen3-14B | |
| #--hf-checkpoint /root/Qwen3-4B-FP8 | |
| --ref-load /root/Qwen3-14B_torch_dist | |
| --load /root/Qwen3-14B_miles/ | |
| --save /root/Qwen3-14B_miles/ | |
| --save-interval 20 | |
| --hf-checkpoint "${DATA_ROOT}/Qwen3-14B" | |
| #--hf-checkpoint "${DATA_ROOT}/Qwen3-4B-FP8" | |
| --ref-load "${DATA_ROOT}/Qwen3-14B_torch_dist" | |
| --load "${DATA_ROOT}/Qwen3-14B_miles/" | |
| --save "${DATA_ROOT}/Qwen3-14B_miles/" | |
| --save-interval 20 |
As #530, try to reproduce with scripts/run-qwen3-14B.sh. The script succeeds with no failure.