diff --git a/.cursor/rules/docs-assist-merge-conflict-resolution.mdc b/.cursor/rules/docs-assist-merge-conflict-resolution.mdc new file mode 100644 index 00000000..3a5aebe0 --- /dev/null +++ b/.cursor/rules/docs-assist-merge-conflict-resolution.mdc @@ -0,0 +1,71 @@ +--- +description: Resolve the merge conflicts using a "manual rebase" approach +globs: +alwaysApply: false +--- +# LLM-Assisted Merge Conflict Resolution + +When documentation branches fall behind main, use this "smart rebase" approach with LLM assistance to resolve conflicts safely and accurately. + +## Prerequisites +- Identify the current branch name (`user/example-branch`) that contains your changes +- Ensure you have no uncommitted changes in your working directory + +## Resolution Steps + +1. Create a backup of your current branch: + ```bash + git branch backup/user-branch-$(date +%Y-%m-%d) user/example-branch + ``` + +2. Get the latest main branch: + ```bash + git checkout main + git pull origin main + ``` + +3. Create a new resolution branch: + ```bash + git checkout -b conflict-resolution/example-branch + ``` + +4. Bring in changes from your original branch: + ```bash + git merge --squash user/example-branch + ``` + +5. LLM-Assisted Conflict Resolution: + a. For each conflict marker (`<<<<<<<`), the LLM will: + - Analyze and explain the conflict context + - Show the semantic differences between versions + - Provide a recommended resolution with rationale + - Wait for user approval before proceeding + + b. After each resolution: + - LLM verifies the resolved content maintains technical accuracy + - LLM checks for documentation consistency + - LLM ensures cross-references remain valid + +6. Final Validation: + - LLM performs comprehensive review of all resolved files + - Verifies all conflict markers are removed + - Checks documentation structure remains intact + - Ensures all technical content is accurate + - Validates all internal references and links + +7. Commit the resolved changes: + ```bash + git add . + git commit -m "Resolve conflicts from user/example-branch + + - List major conflicts resolved + - Note any significant decisions made + - Reference relevant documentation updates" + ``` + +## Post-Resolution +You now have a clean branch based on latest main with your changes properly integrated. The backup branch can be deleted once you've verified everything is correct: +```bash +git branch -D backup/user-branch-YYYY-MM-DD +``` + diff --git a/.cursor/rules/docs-bump-version.mdc b/.cursor/rules/docs-bump-version.mdc new file mode 100644 index 00000000..ddef7bd5 --- /dev/null +++ b/.cursor/rules/docs-bump-version.mdc @@ -0,0 +1,13 @@ +--- +description: Version Bump Instructions for Docs Publishing +globs: +alwaysApply: false +--- + +1. Update the [versions1.json](mdc:docs/versions1.json) file with the user's provided version by adding a new entry at the top and updating preferred to false for the previous entry. +2. Update the [repo.toml](mdc:repo.toml) `version` to the latest version provided by the user. +2. Create a tag for the latest commit on the `main` branch in the format of "git tag docs-v{}.{}.{}`. +3. Push the tag. +4. Recap everything you did to prepare for the release. + +If a user asks you to bump the version but hasn't provided a full version number, ask for clarification on the version number. \ No newline at end of file diff --git a/.cursor/rules/docs-check-source.mdc b/.cursor/rules/docs-check-source.mdc new file mode 100644 index 00000000..9fa26658 --- /dev/null +++ b/.cursor/rules/docs-check-source.mdc @@ -0,0 +1,8 @@ +--- +description: Where to find source code for drafting details. +globs: +alwaysApply: false +--- +- You can find the source code used to draft docs in the project's main source directory. + +- Make sure the main branch is up to date when verifying this information, as that represents the current development state. \ No newline at end of file diff --git a/.cursor/rules/docs-info-verification.mdc b/.cursor/rules/docs-info-verification.mdc new file mode 100644 index 00000000..67311fdd --- /dev/null +++ b/.cursor/rules/docs-info-verification.mdc @@ -0,0 +1,283 @@ +--- +description: Guidelines for verifying technical documentation claims against source code to ensure accuracy between documentation and implementation. +globs: +alwaysApply: false +--- +# NeMo Run Documentation Verification Guidelines + +These guidelines ensure technical documentation accurately reflects the actual NeMo Run project implementation and maintains the highest standards of factual accuracy. Follow these verification procedures when writing or updating documentation to prevent inaccuracies that could mislead users. + +## Editorial Accuracy Principles + +### No Speculation or Extrapolation +- **Document only verified functionality**: Never document presumed behavior or "likely" features without verifying against source code and `pyproject.toml` +- **Avoid assumptions**: Don't make assumptions about unreleased functionality or implementation details +- **Source everything**: If information cannot be verified against the codebase, acknowledge the gap rather than guessing + +### Handling Uncertainty and Missing Information +- **Use clear qualifiers**: When certainty is limited, use phrases like "as of version X..." or "in the current implementation..." +- **Mark experimental features**: Explicitly state when documentation covers experimental or beta features +- **Document gaps clearly**: When information is missing, state this clearly and direct readers to source code or examples +- **Use appropriate warnings**: Include admonitions for important caveats: + ```markdown + :::{warning} + This feature is experimental and may change in future releases without notice. + :::: + ``` + +### Performance and Benchmarking Claims +- **Require evidence**: Back all performance statements with specific, reproducible benchmarks from `test/` or `examples/` +- **Include methodology**: Document test environment details, hardware specifications, and dataset characteristics +- **Use precise language**: Avoid vague terms like "fast" or "efficient" - provide quantifiable metrics +- **Comparative claims**: Ensure performance comparisons are fair, accurate, and based on equivalent test conditions + +## NeMo Run Project Configuration Verification + +### pyproject.toml Validation +**Critical Rule**: Always check `pyproject.toml` for project configuration, dependencies, and CLI commands before documenting. + +- **Console Scripts**: Verify all documented CLI commands exist in `[project.scripts]` section: + - `nemorun` = `nemo_run.__main__:app` + - `nemo` = `nemo_run.__main__:app` +- **Dependencies**: Check that documented dependencies match `[project.dependencies]` and optional dependencies: + - Core: `inquirerpy`, `catalogue`, `fabric`, `fiddle`, `torchx`, `typer`, `rich`, `jinja2`, `cryptography`, `networkx`, `omegaconf`, `leptonai`, `packaging`, `toml` + - Optional: `skypilot`, `ray` (with `kubernetes`) +- **Project Metadata**: Verify project name (`nemo_run`), version, and description match documentation +- **Entry Points**: Confirm all documented entry points are properly registered in `[project.entry-points."torchx.schedulers"]` + +### NeMo Run CLI Command Verification Process +1. **Check pyproject.toml**: Look in `[project.scripts]` for console script definitions (`nemorun` and `nemo`) +2. **Verify Script Paths**: Ensure the script paths point to `nemo_run.__main__:app` +3. **Test CLI Commands**: Run documented CLI commands to ensure they work with the Typer-based CLI system +4. **Validate Arguments**: Check script argument parsing matches documented examples in `nemo_run/cli/api.py` +5. **Cross-Reference**: Verify CLI examples work with the actual script implementations in `nemo_run/cli/` + +### NeMo Run CLI Validation Checks +- ✅ All documented CLI commands exist in `[project.scripts]` section of `pyproject.toml` +- ✅ Script paths in `pyproject.toml` point to `nemo_run.__main__:app` +- ✅ CLI argument examples match Typer argument parsing in `nemo_run/cli/api.py` +- ✅ CLI commands execute without errors with documented arguments +- ✅ Console script entry points are properly formatted +- ✅ Entrypoint decorators in `nemo_run/cli/api.py` match documented functionality + +## NeMo Run Code Example Verification + +### Source Code Validation +**Primary Rule**: All code examples must be validated against the actual implementation in the `nemo_run/` source directory before publication. + +- **Import Statements**: Verify all imports reference actual modules in the `nemo_run/` package +- **Class/Function Signatures**: Check that documented method signatures match source code exactly +- **Parameter Types**: Confirm all parameter types, defaults, and constraints match implementation +- **Return Values**: Verify documented return types and structures match actual code + +### NeMo Run Systematic Code Verification Process +1. **Locate Source Implementation**: Find the relevant class/function in the `nemo_run/` source directory +2. **Check pyproject.toml**: Verify any CLI commands, dependencies, or entry points +3. **Check Signatures**: Compare documented signatures with actual implementation +4. **Verify Parameters**: Ensure all required/optional parameters are correctly documented +5. **Test Imports**: Verify import paths work from documented context +6. **Execute Examples**: Run code examples to ensure they work with current codebase + +### NeMo Run Code Validation Checks +- ✅ All imports can be resolved from the `nemo_run` package +- ✅ Class constructors match documented parameters (especially in `nemo_run/config.py`) +- ✅ Method signatures include all required arguments +- ✅ Default parameter values match implementation +- ✅ Documented exceptions match those raised in source code +- ✅ Configuration classes match their actual schemas in `nemo_run/config.py` + +## NeMo Run Configuration Documentation + +### Configuration Schema Verification +- **Extract from Source**: Pull all configurable options directly from config classes in `nemo_run/config.py` +- **Required vs Optional**: Clearly distinguish required and optional configuration parameters +- **Default Values**: Verify defaults match implementation, not documentation assumptions +- **Valid Options**: Document valid ranges, enums, and interdependencies from source code + +### NeMo Run Configuration Validation Steps +1. **Locate Config Classes**: Find relevant configuration classes in `nemo_run/config.py` +2. **Extract Parameters**: List all configurable parameters from class definitions +3. **Check Defaults**: Verify default values match source code +4. **Test Examples**: Ensure configuration examples work with actual implementations +5. **Validate Constraints**: Check parameter validation logic in source code + +## NeMo Run Execution Backend Documentation + +### Execution Backend Verification +**Critical for NeMo Run**: Verify all execution backend documentation against actual implementations in `nemo_run/core/execution/` and `nemo_run/run/torchx_backend/schedulers/`. + +- **Backend Implementations**: Check documented backends exist in source: + - `nemo_run/core/execution/`: `local.py`, `slurm.py`, `skypilot.py`, `docker.py`, `dgxcloud.py`, `lepton.py`, `kuberay.py` + - `nemo_run/run/torchx_backend/schedulers/`: Corresponding scheduler implementations +- **Entry Points**: Verify TorchX scheduler entry points in `pyproject.toml` match documented backends +- **Configuration Options**: Check backend-specific configuration options against actual implementations +- **Template Files**: Verify template files in `nemo_run/core/execution/templates/` match documented usage + +### Execution Backend Validation Steps +1. **Check Backend Files**: Verify documented backends exist in `nemo_run/core/execution/` +2. **Check Scheduler Files**: Verify corresponding TorchX schedulers in `nemo_run/run/torchx_backend/schedulers/` +3. **Check Entry Points**: Verify `[project.entry-points."torchx.schedulers"]` entries match documented backends +4. **Test Configuration**: Ensure backend configuration examples work with actual implementations +5. **Validate Templates**: Check template files match documented usage patterns + +## NeMo Run CLI Entrypoint Documentation + +### Entrypoint System Verification +**Unique to NeMo Run**: Verify all CLI entrypoint documentation against the entrypoint system in `nemo_run/cli/api.py`. + +- **Entrypoint Decorators**: Check documented `@entrypoint` decorators match actual implementation +- **Factory Functions**: Verify documented `@factory` decorators work with actual factory system +- **CLI Argument Parsing**: Check documented CLI argument syntax matches `nemo_run/cli/cli_parser.py` +- **Run Context**: Verify documented run context options match `RunContext` class in `nemo_run/cli/api.py` + +### Entrypoint Validation Steps +1. **Check Entrypoint Decorators**: Verify documented entrypoints match `@entrypoint` implementation +2. **Check Factory Decorators**: Verify documented factories match `@factory` implementation +3. **Test CLI Syntax**: Ensure documented CLI argument syntax works with actual parser +4. **Validate Run Options**: Check documented run options match `RunContext` class +5. **Test Examples**: Run entrypoint examples to ensure they work correctly + +## Tutorial and Example Verification + +### Executable Examples +- **Test All Tutorials**: Every tutorial in `docs/get-started/` should be executable against current codebase +- **Complete Examples**: Include all necessary imports, data setup, and dependencies +- **Data Requirements**: Verify example datasets and file formats are correctly specified +- **Environment Setup**: Ensure documented environment requirements are sufficient + +### NeMo Run Example Quality Standards +- **Realistic Data**: Use realistic dataset examples, not just placeholder text +- **Complete Workflows**: Examples should demonstrate end-to-end workflows with NeMo Run +- **Error Handling**: Document common error scenarios with actual error messages +- **Best Practices**: Examples should follow established patterns from `test/` directory + +## NeMo Run Project Component Documentation + +### Component Verification +Verify component documentation against the actual NeMo Run project structure: +- **Core Modules**: Check core functionality against `nemo_run/` source directory +- **CLI System**: Verify CLI documentation against `nemo_run/cli/` modules +- **Execution System**: Validate execution documentation against `nemo_run/core/execution/` and `nemo_run/run/` +- **Configuration System**: Ensure configuration docs match `nemo_run/config.py` +- **Extensions**: Validate extension or plugin documentation against `nemo_run/run/plugin.py` + +### NeMo Run Component Validation Steps +1. **Identify Components**: Map documented functionality to actual implementations in `nemo_run/` +2. **Check CLI Availability**: Verify CLI commands exist in `pyproject.toml` if documented +3. **Check Compatibility**: Verify component combinations work as documented +4. **Test Data Flow**: Ensure data formats between components are correct +5. **Validate Outputs**: Check that documented output formats match actual results + +## Automated Verification Integration + +### Code Validation Automation +- **Import Testing**: Automated tests that verify all documented imports work from `nemo_run` package +- **CLI Testing**: Automated tests that verify all documented CLI commands exist and work +- **Example Execution**: CI pipeline that runs tutorial and example code +- **Link Checking**: Verify all references to source code files are valid +- **Documentation Testing**: Include documentation examples in test suites + +### NeMo Run Verification Tooling +- **Pytest Integration**: Use pytest to validate code examples in documentation +- **CLI Validation**: Test that all documented CLI commands can be executed +- **Configuration Validation**: Test configuration examples against actual schemas +- **Import Resolution**: Check that all documented imports resolve correctly from `nemo_run` package +- **Backend Testing**: Test execution backend examples against actual implementations + +## Verification Workflow + +### During Documentation Creation +1. **Check pyproject.toml**: Verify CLI commands, dependencies, and project configuration +2. **Identify Source Components**: Locate relevant modules/classes in the `nemo_run/` source +3. **Extract Implementation Details**: Document actual parameters, types, and behavior +4. **Create Working Examples**: Build examples that execute successfully +5. **Test Against Source**: Validate examples against current codebase +6. **Document Dependencies**: List all required packages and versions + +### During Documentation Updates +1. **Check for Config Changes**: Compare current `pyproject.toml` against previously documented versions +2. **Check for Code Changes**: Compare current source against previously documented versions +3. **Update Examples**: Modify examples to reflect any implementation changes +4. **Re-validate**: Run full validation process on updated examples +5. **Test Breaking Changes**: Verify that source changes don't break existing examples + +### Quality Assurance Checklist + +Before publishing NeMo Run documentation: +- [ ] CLI commands verified against `[project.scripts]` in `pyproject.toml` +- [ ] Dependencies match those listed in `pyproject.toml` +- [ ] All code examples tested against current `nemo_run/` source code +- [ ] Import statements verified for correctness from `nemo_run` package +- [ ] Configuration examples tested with actual implementations in `nemo_run/config.py` +- [ ] Execution backend examples tested against actual implementations +- [ ] CLI entrypoint examples tested with actual entrypoint system +- [ ] Function/class signatures match source code exactly +- [ ] Example datasets and file formats validated +- [ ] Error scenarios documented with actual error messages +- [ ] Dependencies clearly specified in requirements +- [ ] TorchX scheduler entry points verified in `pyproject.toml` + +## When Discrepancies Are Found + +### Immediate Actions +1. **Document the Issue**: Create clear warning admonitions in affected documentation +2. **Track the Problem**: File GitHub issues with specific details about the discrepancy +3. **Determine Root Cause**: Investigate whether source code, `pyproject.toml`, or documentation needs correction +4. **Establish Timeline**: Set clear deadlines for resolution + +### Example Warning Format +```markdown +:::{warning} +**Implementation Verification Required**: This example has not been validated against the current NeMo Run project source code and configuration. Please verify imports, CLI commands, and parameters before use. +::: +``` + +## Verification Tracking + +### Documentation Metadata +- Maintain "last verified" timestamps on code example sections +- Track which project version was used for verification +- Document any known discrepancies and their resolution status +- Link to specific source files and `pyproject.toml` sections used for verification + +### Coverage Metrics +- Measure percentage of code examples that have been source-validated +- Track CLI command documentation coverage against `pyproject.toml` entries +- Track documentation verification coverage across different `nemo_run/` modules +- Prioritize verification for high-impact or frequently-used components +- Monitor source code change frequency and documentation update lag + +## NeMo Run Component-Specific Guidelines + +### For CLI Commands +- **Always Check pyproject.toml First**: Before documenting any CLI command, verify it exists in `[project.scripts]` +- **Verify Script Paths**: Ensure the console script paths point to `nemo_run.__main__:app` +- **Test Command Execution**: Run CLI commands with documented examples to ensure they work +- **Check Argument Parsing**: Verify documented arguments match actual Typer argument parsing in `nemo_run/cli/api.py` + +### For NeMo Run Modules +Adapt these guidelines based on the specific NeMo Run project structure: +- **Core Functionality**: Always validate against main `nemo_run/` modules +- **CLI System**: Verify examples work with CLI modules in `nemo_run/cli/` +- **Configuration**: Check configuration examples against actual config classes in `nemo_run/config.py` +- **Execution Backends**: Verify execution examples against implementations in `nemo_run/core/execution/` +- **TorchX Integration**: Check TorchX examples against `nemo_run/run/torchx_backend/` + +### For Configuration Examples +- **Configuration Files**: Validate YAML/JSON configs against actual implementations +- **Environment Configs**: Check environment-specific examples work with documented parameters +- **Deployment Configs**: Verify deployment examples against actual deployment implementations + +### For Dependencies and Installation +- **Package Dependencies**: Verify all documented dependencies exist in `pyproject.toml` +- **Optional Dependencies**: Check optional dependency groups (`skypilot`, `ray`) match documentation +- **Version Requirements**: Ensure documented version constraints match `pyproject.toml` +- **Installation Instructions**: Test installation commands work with current project configuration + +### For Execution Backends +- **Backend Availability**: Verify documented backends exist in `nemo_run/core/execution/` +- **Scheduler Integration**: Check TorchX scheduler entry points in `pyproject.toml` +- **Configuration Options**: Validate backend-specific configuration against actual implementations +- **Template Files**: Verify template usage matches files in `nemo_run/core/execution/templates/` + +Use these verification guidelines to maintain the highest level of accuracy between NeMo Run source code implementation, project configuration, and documentation, ensuring users can successfully execute workflows and use the project as documented. diff --git a/.cursor/rules/docs-os.mdc b/.cursor/rules/docs-os.mdc new file mode 100644 index 00000000..356c6fe8 --- /dev/null +++ b/.cursor/rules/docs-os.mdc @@ -0,0 +1,13 @@ +--- +description: +globs: +alwaysApply: true +--- +When a user provides one of the following commands, perform its associated task. + +- **Commands**: + - "::a": check the article against the source code in the project's main source directory using the [docs-info-verification.mdc](mdc:.cursor/rules/docs-info-verification.mdc) + - "::r": revise the article using the [docs-style-guide.mdc](mdc:.cursor/rules/docs-style-guide.mdc) + - "::f": format the article using the [docs-tooling.mdc](mdc:.cursor/rules/docs-tooling.mdc) + - "::bv": bump project version using [docs-bump-version.mdc](mdc:.cursor/rules/docs-bump-version.mdc) + - "::h-mr": help resolve a merge conflict using [docs-assist-merge-conflict-resolution.mdc](mdc:.cursor/rules/docs-assist-merge-conflict-resolution.mdc) \ No newline at end of file diff --git a/.cursor/rules/docs-personas.mdc b/.cursor/rules/docs-personas.mdc new file mode 100644 index 00000000..67fb9e1e --- /dev/null +++ b/.cursor/rules/docs-personas.mdc @@ -0,0 +1,57 @@ +--- +description: Documentation guidelines for addressing specific audience personas including data scientists, MLEs, cluster administrators, and DevOps professionals. +globs: +alwaysApply: false +--- +# Documentation Personas + +Our documentation is for enterprise AI products and features consumed by personas like data scientists, machine learning engineers, cluster administrators, and devops professionals. Common tools include: Kubernetes, notebooks, python, AI models. + +## Persona-Specific Guidelines + +### For Data Scientists +- Include details on data formats and schemas +- Document model validation methodologies clearly +- Specify computational requirements for processing +- Provide notebook examples with clear annotations +- Include metrics for model performance evaluation +- Document data preprocessing and feature engineering steps +- Explain hyperparameter tuning approaches + +### For Machine Learning Engineers +- Document model training pipelines completely +- Include details on environment setup and dependencies +- Provide testing and validation methodologies +- Explain model deployment workflows +- Document CI/CD integration for ML pipelines +- Include performance optimization techniques +- Provide examples of model serving configurations + +### For Cluster Administrators +- Include complete deployment architectures +- Document security considerations thoroughly +- Provide detailed monitoring and maintenance procedures +- Include troubleshooting guides for common issues +- Document resource allocation recommendations +- Provide scaling guidelines for different workloads +- Include backup and disaster recovery procedures + +### For DevOps Professionals +- Document infrastructure-as-code implementations +- Include automation workflows and templates +- Provide observability and monitoring setup +- Document CI/CD pipeline integrations +- Include security best practices for deployments +- Explain high availability configurations +- Document canary deployment and rollback procedures + +## Writing Style by Persona + +When writing for different personas, adjust your style: + +- **Data Scientists**: Focus on analytical concepts and model behavior +- **ML Engineers**: Emphasize implementation details and pipeline architecture +- **Cluster Admins**: Prioritize operational stability and system interactions +- **DevOps**: Focus on automation, reliability, and maintainability + +Keep these personas in mind when drafting and evaluating documentation. \ No newline at end of file diff --git a/.cursor/rules/docs-style-guide.mdc b/.cursor/rules/docs-style-guide.mdc new file mode 100644 index 00000000..97d34f0e --- /dev/null +++ b/.cursor/rules/docs-style-guide.mdc @@ -0,0 +1,17 @@ +--- +description: The documentation style guide. +globs: +alwaysApply: false +--- +Adhere to the NVIDIA Style Guide when drafting or editing content. Ensure the following: + +1. Brand Name: Ensure "NVIDIA" is always in all caps. +2. Voice & Tone: Use PACE (Professional, Active, Conversational, Engaging). Maintain a consistent, natural voice. Use active voice and contractions. +Abbreviations & Acronyms: Spell out on first use, then use abbreviations. Common acronyms like PC don't need to be spelled out. +4. Capitalization: Use title case for headings and proper nouns. Avoid capitalizing conjunctions and prepositions of three letters or less. +Punctuation: Use Oxford commas. Avoid ampersands unless in titles. Use em dashes for emphasis, en dashes for ranges. +Dates & Times: Use the 12-hour format with periods (e.g., 12:45 p.m.). Abbreviate months in tables. +Units of Measurement: Be consistent. Use a space between the number and unit (e.g., 40 GB). +Inclusivity: Avoid gender-specific pronouns and Latinisms. Use plain English. +10. Formatting: Use italics for publication titles and games. Use bold for UI elements. +Please review the text for these guidelines and suggest any necessary edits to ensure compliance with the NVIDIA Style Guide. \ No newline at end of file diff --git a/.cursor/rules/docs-tooling.mdc b/.cursor/rules/docs-tooling.mdc new file mode 100644 index 00000000..e5f1ac63 --- /dev/null +++ b/.cursor/rules/docs-tooling.mdc @@ -0,0 +1,132 @@ +--- +description: Specialized MyST/Sphinx markdown directives for documentation files. Apply when drafting or editing documentation that uses directives, grids, tabs, admonitions, or other Sphinx extensions. +globs: +alwaysApply: false +--- +# Sphinx/MyST Documentation Tooling Rule + +Our documentation uses Sphinx with MyST markdown and sphinx-design. Follow these rules for all files in `docs/`: + +- **Headers:** + - Every header (including subheaders) must have a unique relref label above it in the format `(section-article-header)=`. + - Example: `(evaluate-jobs-create)=` + - Use lowercase, hyphens, and abbreviate as needed. + - Avoid duplicate labels across the project. + - **Example:** + ```markdown + (evaluate-jobs-create)= + # Create Evaluation Job + + (evaluate-jobs-create-prereqs)= + ## Prerequisites + ``` + +- **Code Samples:** + - Use dropdowns for code blocks longer than 10 lines or for any code that may distract from the main flow. + - Title dropdowns meaningfully (e.g., “Python Example”, “Full YAML Config”). + - **Example:** + ```markdown + :::{dropdown} Python Example + :icon: code-square + + ```python + def hello(): + print("Hello, world!") + # ...more code... + ``` + ::: + ``` + +- **Tabs:** + - Use tab-sets for alternative options (e.g., curl vs. python, CLI vs. UI). + - Each tab should have a clear, descriptive title. + - Use the `:sync:` attribute for tab-sets to synchronize tab selection across the page. + - **Example:** + ```markdown + :::: {tab-set} + + ::: {tab-item} Python + :sync: sync-pyth + `requests.get("https://api.example.com")` + ::: + ::: {tab-item} curl + :sync: sync-cli + `curl https://api.example.com` + ::: + :::: + ``` + +- **Cards:** + - Use grid card links for parent/overview pages with multiple links to subtopics or guides. + - Each card should have a short title and a 1–2 sentence description. + - **Example:** + ```markdown + :::: {grid} 1 2 2 2 + :gutter: 1 1 1 2 + + ::: {grid-item-card} Getting Started + :link: get-started-index + :link-type: ref + + + + Learn how to get started with the platform. + ::: + :::: + ``` + +- **Lists/Tables:** + - Use lists or tables for short, related items, steps, or comparisons. + - Prefer tables for side-by-side feature or parameter comparisons. + - **Example (Table):** + ```markdown + | Feature | Supported | + |---------|-----------| + | Tabs | Yes | + | Cards | Yes | + ``` + - **Example (List):** + ```markdown + - Step 1: Do this + - Step 2: Do that + ``` + - Prefer myst list tables for advanced tables over markdown tables + +- **Admonitions:** + - Use admonitions sparingly because they interrupt the flow of information. + - Use a note to identify surprising or unexpected behavior. + - Use a warning to identify risk of physical injury or data loss. + - Use a tip to reveal a positive behavior in the software that someone might not discover on their own. + - Example: + - `note` for general info + - `warning` for risks + - `tip` for best practices + - **Example:** + ```markdown + :::{note} + You must have an API key before proceeding. + ::: + ``` + +- **References:** + - Use `{ref}` for internal links, referencing the relref label. + - Check for existing relref labels before creating new ones. + - **Example:** + ```markdown + Refer to the {ref}`customization job ` guide. + ``` + +- **Accessibility:** + - Add alt text to images. + - Use descriptive tab and dropdown titles. + - Ensure tables and lists are readable by screen readers. + - Avoid abelist language like "see", "above", "below". + +- **Style & Consistency:** + - Follow the style guide for tone, terminology, and formatting. + - Use consistent abbreviations and naming conventions. + +- **If Unsure:** + - Check existing docs for examples. + - Ask for clarification if a rule or pattern is unclear. + diff --git a/.vscode/settings.json b/.vscode/settings.json index 49ae8905..b4459671 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -20,5 +20,6 @@ "test" ], "python.testing.unittestEnabled": false, - "python.testing.pytestEnabled": true + "python.testing.pytestEnabled": true, + "iis.configDir": "" } diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 74c2d656..bec1baf5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -30,10 +30,10 @@ Not all steps are necessary for some contributions, so read the linked sections We use [uv](https://docs.astral.sh/uv/) to develop NeMo Run. The following steps should get you started with the dev environment: 1. Install [uv](https://docs.astral.sh/uv/getting-started/installation/) -2. Clone NeMo-Run +2. Clone NeMo Run 3. Sanity check with `uv sync --extra skypilot && uv run -- pytest test/` (This will create a venv and run all unit tests) -If all tests passed, then you should be good to get started with the development of NeMo-Run. +If all tests passed, then you should be good to get started with the development of NeMo Run. ## Code Structure diff --git a/README.md b/README.md index 9ef19030..276aa100 100644 --- a/README.md +++ b/README.md @@ -1,109 +1,171 @@ # NeMo Run +[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) +[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE) +[![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg)](https://docs.nvidia.com/nemo-run) +[![PyPI](https://img.shields.io/badge/pypi-nemo--run-blue.svg)](https://pypi.org/project/nemo-run/) + > [!IMPORTANT] -> NeMo Run is still in active development and this is a pre-release. The API is subject to change without notice while in pre-release. First official release will be 0.1.0 and will be included in NeMo FW 24.09 as well. +> NeMo Run is currently in active development and this is a pre-release. The API is subject to change without notice while in pre-release. The first official release will be 0.1.0 and will be included in NeMo FW 24.09. + +**NeMo Run** is a comprehensive Python framework for configuring, executing, and managing machine learning experiments across diverse computing environments. Built with a focus on reproducibility, flexibility, and scalability, NeMo Run decouples experiment configuration from execution, enabling researchers and ML engineers to seamlessly transition between local development, cloud platforms, and high-performance computing clusters. + +## 🚀 Key Features + +- **🔧 Type-Safe Configuration**: Python-based configuration using Fiddle with automatic validation and type safety +- **🌐 Multi-Environment Execution**: Support for local, Docker, Slurm, Kubernetes, and cloud platforms (AWS, GCP, Azure, DGX Cloud) +- **📊 Experiment Management**: Comprehensive experiment tracking with metadata preservation and reproducibility +- **🎯 Modular Architecture**: Clean separation between configuration, execution, and management layers +- **⚡ Ray Integration**: Native support for distributed computing with Ray +- **🔍 Rich CLI**: Intelligent command-line interface with type-safe argument parsing -NeMo Run is a powerful tool designed to streamline the configuration, execution, and management of machine learning experiments across various computing environments. NeMo Run has three core responsibilities: +## 🏗️ Core Architecture -1. [Configuration](./docs/source/guides/configuration.md) -2. [Execution](./docs/source/guides/execution.md) -3. [Management](./docs/source/guides/management.md) +NeMo Run is built around three core pillars: -To learn more, click on each link. This represents the typical order that NeMo Run users follow for setting up and launching experiments. +### Configuration -- [NeMo Run](#nemo-run) - - [Why Use NeMo Run?](#why-use-nemo-run) - - [Install NeMo Run](#install-nemo-run) - - [Get Started](#get-started) - - [Design Philosophy and Inspiration](#design-philosophy-and-inspiration) - - [Pythonic](#pythonic) - - [Modular](#modular) - - [Opinionated but Flexible](#opinionated-but-flexible) - - [Set Up Once and Scale Easily](#set-up-once-and-scale-easily) - - [Tutorials](#tutorials) - - [Hello world](#hello-world) - - [Contribute to NeMo Run](#contribute-to-nemo-run) - - [FAQs](#faqs) +Python-based configuration using Fiddle, supporting complex nested structures and type safety. See our [Configuration Guide](docs/guides/configuration.md) for detailed information. +### Execution -## Why Use NeMo Run? -Please see this [detailed guide](./docs/source/guides/why-use-nemo-run.md) for reasons to use NeMo Run. +Multi-environment execution with executors for local, Docker, Slurm, Kubernetes, and cloud platforms. Learn more in our [Execution Guide](docs/guides/execution.md). -## Install NeMo Run -To install the project, use the following command: +### Management + +Experiment lifecycle management with metadata tracking, logging, and reproducibility. Explore our [Management Guide](docs/guides/management.md) for comprehensive details. + +## 📦 Installation ```bash pip install git+https://github.com/NVIDIA-NeMo/Run.git ``` -Make sure you have `pip` installed and configured properly. +### Requirements + +- Python 3.8+ +- pip (for package installation) +- Access to computing resources (local, cloud, or cluster) + +## 🚀 Quick Start -## Get Started -To get started with NeMo Run, follow these three steps based on the core responsibilities mentioned above. For this example, we’ll showcase a pre-training example in Nemo 2.0 using Llama3. +Get started with NeMo Run in three simple steps: + +### 1. Configure Your Function -1. Configure your function: ```python from nemo.collections import llm -partial_func = llm.llama3_8b.pretrain_recipe(name="llama3-8b", ckpt_dir="/path/to/store/checkpoints", num_nodes=1, num_gpus_per_node=8) + +# Configure a pre-training recipe +partial_func = llm.llama3_8b.pretrain_recipe( + name="llama3-8b", + ckpt_dir="/path/to/store/checkpoints", + num_nodes=1, + num_gpus_per_node=8 +) ``` -2. Define your Executor: +### 2. Define Your Executor + ```python import nemo_run as run -# Local executor -local_executor = run.LocalExecutor() + +# Choose your execution environment +local_executor = run.LocalExecutor() # Local execution +# docker_executor = run.DockerExecutor() # Docker execution +# slurm_executor = run.SlurmExecutor() # Slurm cluster +# ray_executor = run.RayExecutor() # Ray distributed ``` -3. Run your experiment: +### 3. Run Your Experiment + ```python +# Execute your experiment run.run(partial_func, executor=local_executor, name="llama3_8b_pretraining") ``` -## Design Philosophy and Inspiration -In building NeMo Run, we drew inspiration from and relied on the following primary libraries. We would like to extend our gratitude for their work. +## 🎯 Why Use NeMo Run? + +NeMo Run addresses critical challenges in ML experiment management: + +- **🔧 Configuration Flexibility**: Type-safe, composable configurations with Python's type system +- **🚀 Execution Modularity**: True environment independence with executor abstraction +- **📊 Experiment Management**: Comprehensive tracking with full metadata preservation +- **🔄 Reproducibility**: One-command experiment reconstruction from metadata +- **⚡ Scalability**: Seamless transition from local development to distributed clusters -- [Fiddle](https://github.com/google/fiddle) -- [TorchX](https://github.com/pytorch/torchx/) -- [Skypilot](https://github.com/skypilot-org/skypilot/) -- [XManager](https://github.com/google-deepmind/xmanager/tree/main) -- [Fabric](https://github.com/fabric/fabric) and [Paramiko](https://github.com/paramiko/paramiko) -- [Rich](https://github.com/Textualize/rich) -- [Jinja](https://github.com/pallets/jinja/) +For detailed information, see our [Why Use NeMo Run guide](docs/about/why-nemo-run.md). -Apart from these, we also build on other libraries. A full list of dependencies can be found in [pyproject.toml](pyproject.toml). +## 📚 Documentation -NeMo Run was designed keeping the following principles in mind: +- **[Getting Started](docs/get-started/index.md)** - Quick setup and tutorials +- **[Configuration Guide](docs/guides/configuration.md)** - Type-safe configuration management +- **[Execution Guide](docs/guides/execution.md)** - Multi-environment execution +- **[Management Guide](docs/guides/management.md)** - Experiment lifecycle management +- **[CLI Reference](docs/reference/cli.md)** - Command-line interface documentation +- **[FAQs](docs/reference/faqs.md)** - Frequently asked questions + +## 🎓 Tutorials + +### Hello World Series + +The `hello_world` tutorial series provides a comprehensive introduction to NeMo Run: + +- **[Part 1](https://github.com/NVIDIA-NeMo/Run/blob/main/examples/hello-world/hello_world.ipynb)** - Basic configuration and execution +- **[Part 2](https://github.com/NVIDIA-NeMo/Run/blob/main/examples/hello-world/hello_experiments.ipynb)** - Experiment management +- **[Part 3](https://github.com/NVIDIA-NeMo/Run/blob/main/examples/hello-world/hello_scripts.py)** - Script-based execution + +## 🏛️ Design Philosophy + +NeMo Run was designed with these core principles: ### Pythonic -In NeMo Run, you can build and configure everything using Python, eliminating the need for multiple combinations of tools to manage your experiments. The only exception is when setting up the environment for remote execution, where we rely on Docker. + +Build and configure everything using Python, eliminating the need for multiple tools to manage experiments. ### Modular -The decoupling of task and executor allows you to form different combinations of execution units with relative ease. You configure different remote environments once, and you can reuse it across a variety of tasks in a Pythonic way. + +Decoupled task and executor design allows easy combination of different execution environments. ### Opinionated but Flexible -NeMo Run is opinionated in some places, like storing of metadata information for experiments in a particular manner. However, it remains flexible enough to accommodate most user experiments. + +Opinionated in metadata storage and experiment structure, but flexible enough for most use cases. ### Set Up Once and Scale Easily -While it may take some time initially for users to become familiar with NeMo Run concepts, the tool is designed to scale experimentation in a fluid and easy manner. -## Tutorials +Initial learning curve pays off with fluid and easy experimentation scaling. + +## 🤝 Contributing + +We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details on: + +- Code of Conduct +- Development Setup +- Pull Request Process +- Issue Reporting + +## 📄 License + +This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. + +## 🙏 Acknowledgments -#### Hello world +NeMo Run builds upon the excellent work of these open-source projects: -The `hello_world` tutorial series provides a comprehensive introduction to NeMo Run, demonstrating its capabilities through a simple example. The tutorial covers: +- [Fiddle](https://github.com/google/fiddle) - Configuration framework +- [TorchX](https://github.com/pytorch/torchx/) - Job submission framework +- [Skypilot](https://github.com/skypilot-org/skypilot/) - Multi-cloud execution +- [XManager](https://github.com/google-deepmind/xmanager) - Experiment management +- [Ray](https://github.com/ray-project/ray) - Distributed computing +- [Rich](https://github.com/Textualize/rich) - Rich terminal output +- [Typer](https://github.com/tiangolo/typer) - CLI framework -- Configuring Python functions using `Partial` and `Config` classes. -- Executing configured functions locally and on remote clusters. -- Visualizing configurations with `graphviz`. -- Creating and managing experiments using `run.Experiment`. +## 📞 Support -You can find the tutorial series below: -- [Part 1](examples/hello-world/hello_world.ipynb). -- [Part 2](examples/hello-world/hello_experiments.ipynb). -- [Part 3](examples/hello-world/hello_scripts.py). +- **Documentation**: [docs.nvidia.com/nemo-run](https://docs.nvidia.com/nemo-run) +- **Issues**: [GitHub Issues](https://github.com/NVIDIA-NeMo/Run/issues) +- **Discussions**: [GitHub Discussions](https://github.com/NVIDIA-NeMo/Run/discussions) -## Contribute to NeMo Run -Please see the [contribution guide](./CONTRIBUTING.md) to contribute to NeMo Run. +--- -## FAQs -Please find a list of frequently asked questions [here](./docs/source/faqs.md). +**NeMo Run** is developed by [NVIDIA](https://www.nvidia.com/) as part of the NeMo framework for large language models and AI research. diff --git a/archive/docs-improvements-analysis.md b/archive/docs-improvements-analysis.md new file mode 100644 index 00000000..388189bc --- /dev/null +++ b/archive/docs-improvements-analysis.md @@ -0,0 +1,263 @@ +# NeMo Run Documentation Improvements Analysis + +## Executive Summary + +The NeMo Run documentation has undergone a significant transformation from the archive version (`archive/docs/source`) to the current version (`/docs`). This analysis documents the major improvements in structure, content organization, user experience, and technical capabilities that have been implemented to create a more comprehensive, accessible, and professional documentation system. + +## Directory Structure Comparison + +### Archive Structure (archive/docs/source) + +``` +archive/docs/source/ +├── conf.py (2.3KB, basic Sphinx config) +├── index.rst (2.4KB, simple landing page) +├── faqs.md (6.1KB, standalone FAQ) +└── guides/ + ├── index.rst (146B, minimal) + ├── cli.md (20KB) + ├── configuration.md (7.0KB) + ├── execution.md (16KB) + ├── management.md (5.9KB) + ├── ray.md (9.8KB) + └── why-use-nemo-run.md (4.6KB) +``` + +### Current Structure (/docs) + +``` +docs/ +├── conf.py (9.3KB, advanced configuration) +├── nemo-run-index.md (5.6KB, comprehensive landing page) +├── README.md (16KB, detailed overview) +├── BUILD_INSTRUCTIONS.md (9.5KB, build documentation) +├── project.json (54B, metadata) +├── versions1.json (102B, version tracking) +├── test_json_output.py (7.0KB, testing) +├── _extensions/ (custom Sphinx extensions) +├── _build/ (build artifacts) +├── about/ +│ ├── index.md (6.2KB, overview) +│ ├── key-features.md (29KB, comprehensive features) +│ └── why-nemo-run.md (6.6KB, value proposition) +├── get-started/ +│ ├── index.md (1.3KB, getting started overview) +│ ├── install.md (9.5KB, detailed installation) +│ ├── quickstart.md (11KB, quick start guide) +│ └── tutorials.md (10KB, learning resources) +├── guides/ +│ ├── index.md (2.9KB, guides overview) +│ ├── configuration.md (18KB, expanded configuration) +│ ├── execution.md (22KB, expanded execution) +│ ├── management.md (18KB, expanded management) +│ ├── packaging.md (15KB, new packaging guide) +│ └── ray.md (27KB, expanded Ray documentation) +└── reference/ + ├── index.md (2.2KB, reference overview) + ├── cli.md (23KB, expanded CLI reference) + ├── faqs.md (13KB, expanded FAQs) + ├── troubleshooting.md (11KB, new troubleshooting guide) + └── glossary.md (7.3KB, new glossary) +``` + +## Major Improvements + +### 1. Information Architecture & Organization + +#### **Before (Archive)** + +- Flat structure with minimal organization +- Single landing page with basic navigation +- No clear user journey or information hierarchy +- Limited content categorization + +#### **After (Current)** + +- **Hierarchical Information Architecture**: Clear separation into About, Get Started, Guides, and Reference sections +- **User-Centric Organization**: Content organized by user needs and experience levels +- **Progressive Disclosure**: Information presented in logical progression from overview to detailed implementation +- **Clear Navigation Paths**: Multiple entry points for different user types + +### 2. Content Expansion & Depth + +#### **Content Volume Growth** + +- **Total Content**: Increased from ~70KB to ~200KB+ of documentation +- **Configuration Guide**: 7.0KB → 18KB (157% increase) +- **Execution Guide**: 16KB → 22KB (38% increase) +- **Management Guide**: 5.9KB → 18KB (205% increase) +- **Ray Documentation**: 9.8KB → 27KB (176% increase) +- **CLI Reference**: 20KB → 23KB (15% increase) + +#### **New Content Areas** + +- **Packaging Strategies**: 15KB comprehensive guide (completely new) +- **Troubleshooting Guide**: 11KB detailed troubleshooting (completely new) +- **Technical Glossary**: 7.3KB terminology reference (completely new) +- **Installation Guide**: 9.5KB detailed setup instructions (completely new) +- **Quickstart Guide**: 11KB hands-on tutorial (completely new) +- **Key Features**: 29KB comprehensive technical overview (completely new) + +### 3. Technical Documentation Quality + +#### **Before (Archive)** + +- Basic Sphinx configuration with minimal extensions +- Simple RST-based structure +- Limited metadata and SEO optimization +- No advanced documentation features + +#### **After (Current)** + +- **Advanced Sphinx Configuration**: 9.3KB vs 2.3KB configuration file +- **Custom Extensions**: Multiple custom Sphinx extensions for enhanced functionality +- **Rich Metadata**: Comprehensive frontmatter with descriptions, tags, and categories +- **SEO Optimization**: Structured metadata for better discoverability +- **Interactive Elements**: Grid layouts, tabs, dropdowns, and admonitions +- **Code Examples**: Extensive code examples with syntax highlighting + +### 4. User Experience Enhancements + +#### **Visual Design & Layout** + +- **Grid-Based Layouts**: Modern card-based navigation using Sphinx Design +- **Interactive Elements**: Tabbed content, collapsible sections, and dropdowns +- **Icon Integration**: Octicons for visual hierarchy and navigation +- **Responsive Design**: Mobile-friendly layouts and navigation + +#### **Content Presentation** + +- **Progressive Disclosure**: Information presented in logical layers +- **Multiple Entry Points**: Different paths for different user types +- **Clear Call-to-Actions**: Explicit next steps and navigation guidance +- **Consistent Formatting**: Standardized structure across all documents + +### 5. Technical Capabilities + +#### **Advanced Configuration System** + +- **Type-Safe Configuration**: Detailed documentation of `run.Config`, `run.Partial`, and `run.Script` +- **Configuration Examples**: Extensive code examples showing real-world usage +- **Validation Rules**: Documentation of custom validation and transformation capabilities +- **CLI Integration**: Comprehensive coverage of command-line parameter handling + +#### **Multi-Environment Execution** + +- **Comprehensive Backend Coverage**: Detailed documentation for all execution environments +- **Environment-Specific Guidance**: Tailored instructions for local, Docker, Slurm, cloud, and Kubernetes +- **Resource Management**: Advanced resource allocation and optimization strategies +- **Cost Optimization**: Cloud cost management and optimization techniques + +#### **Packaging Strategies** + +- **Multiple Packaging Options**: GitArchive, Pattern, and Hybrid packagers +- **Deployment Best Practices**: Guidelines for different deployment scenarios +- **Code Reproducibility**: Strategies for ensuring reproducible experiments +- **Performance Optimization**: Packaging strategies for optimal performance + +### 6. Developer Experience + +#### **Installation & Setup** + +- **Comprehensive Installation Guide**: Detailed setup instructions for all environments +- **Optional Dependencies**: Clear guidance on when and how to install optional components +- **Environment-Specific Instructions**: Tailored setup for different platforms +- **Verification Steps**: Clear instructions for verifying successful installation + +#### **Quickstart & Tutorials** + +- **Hands-On Learning**: Step-by-step tutorials with working examples +- **Progressive Complexity**: Tutorials that build from simple to complex scenarios +- **Real-World Examples**: Practical examples that demonstrate real usage patterns +- **Troubleshooting Integration**: Built-in troubleshooting guidance + +### 7. Reference Documentation + +#### **CLI Reference** + +- **Comprehensive Coverage**: Complete command-line interface documentation +- **Usage Examples**: Practical examples for all commands and options +- **Parameter Documentation**: Detailed explanation of all parameters and flags +- **Integration Examples**: Examples showing CLI integration with other tools + +#### **Troubleshooting & Support** + +- **Common Issues**: Comprehensive coverage of frequently encountered problems +- **Error Messages**: Detailed explanation of error messages and resolution steps +- **Debugging Techniques**: Advanced debugging and diagnostic techniques +- **Support Resources**: Clear guidance on getting additional help + +### 8. Content Quality Improvements + +#### **Technical Accuracy** + +- **Source Code Verification**: Documentation aligned with actual implementation +- **API Consistency**: Consistent documentation of APIs and interfaces +- **Version Compatibility**: Clear version requirements and compatibility information +- **Best Practices**: Industry-standard best practices and recommendations + +#### **Writing Quality** + +- **Clear Language**: Technical concepts explained in accessible language +- **Consistent Terminology**: Standardized terminology throughout documentation +- **Logical Flow**: Information presented in logical, progressive order +- **Actionable Content**: Clear, actionable instructions and guidance + +## Technical Infrastructure Improvements + +### 1. Sphinx Configuration + +- **Advanced Extensions**: Custom extensions for enhanced functionality +- **MyST Parser**: Markdown support with advanced features +- **Theme Customization**: NVIDIA Sphinx theme with custom styling +- **Build Optimization**: Optimized build process and output + +### 2. Custom Extensions + +- **Content Gating**: Conditional content based on build parameters +- **JSON Output**: Structured data output for search and indexing +- **AI Assistant**: Intelligent search and response capabilities +- **Enhanced Search**: Advanced search functionality with better results + +### 3. Metadata & SEO + +- **Structured Metadata**: Comprehensive frontmatter with descriptions and tags +- **Search Optimization**: Optimized content for search engines +- **Cross-References**: Internal linking and cross-referencing +- **Version Tracking**: Automated version management and tracking + +## Recommendations for Technical Writers + +### 1. Content Strategy + +- **User Personas**: Continue developing content for specific user types (data scientists, MLEs, DevOps) +- **Progressive Disclosure**: Maintain the layered approach to information presentation +- **Consistent Structure**: Maintain the established content structure and formatting standards +- **Regular Updates**: Establish processes for keeping documentation current with code changes + +### 2. Quality Assurance + +- **Source Code Verification**: Implement regular verification against source code +- **User Testing**: Conduct regular user testing to validate documentation effectiveness +- **Peer Review**: Establish peer review processes for technical accuracy +- **Feedback Integration**: Create mechanisms for collecting and integrating user feedback + +### 3. Maintenance + +- **Version Synchronization**: Ensure documentation stays synchronized with code releases +- **Link Validation**: Regular validation of internal and external links +- **Content Audits**: Periodic audits to identify outdated or missing information +- **Performance Monitoring**: Monitor documentation performance and user engagement + +### 4. Future Enhancements + +- **Interactive Examples**: Consider adding interactive code examples +- **Video Content**: Explore video tutorials for complex concepts +- **Community Contributions**: Establish processes for community documentation contributions +- **Localization**: Consider internationalization for global user base + +## Conclusion + +The transformation of the NeMo Run documentation represents a significant improvement in both quality and comprehensiveness. The new structure provides a much better user experience with clear navigation, comprehensive content, and modern presentation. The technical depth and accuracy have been substantially enhanced, making the documentation a valuable resource for users at all levels. + +The improvements demonstrate best practices in technical documentation, including user-centric design, progressive disclosure, comprehensive coverage, and modern presentation techniques. The foundation established provides an excellent base for continued documentation development and maintenance. diff --git a/docs/BUILD_INSTRUCTIONS.md b/docs/BUILD_INSTRUCTIONS.md new file mode 100644 index 00000000..92fe1e44 --- /dev/null +++ b/docs/BUILD_INSTRUCTIONS.md @@ -0,0 +1,410 @@ +# Documentation Build Instructions + +Complete guide for building the nemo-run documentation. + +## **Prerequisites & Requirements** + +### **1. System Requirements** + +- **Python** (version specified in `.python-version`) +- **uv** package manager (fast Python package installer) +- **Windows PowerShell** (for Windows users) + +### **2. Required Dependencies** (from `requirements-docs.txt`) + +``` +sphinx +myst-parser +sphinx-autodoc2 +sphinx-copybutton +nvidia-sphinx-theme +sphinx-autobuild +sphinx-design +pinecone +openai +python-dotenv +sphinxcontrib-mermaid +swagger-plugin-for-sphinx +``` + +## **Setup Steps** + +### **Step 1: Install uv (if not already installed)** + +```bash +# Windows PowerShell +powershell -c "irm https://astral.sh/uv/install.ps1 | iex" + +# Or via pip +pip install uv +``` + +### **Step 2: Set up the documentation environment** + +```bash +# From project root directory +make docs-env +``` + +This command will: + +- Check if `uv` is installed +- Create virtual environment `.venv-docs` +- Install all dependencies from `requirements-docs.txt` + +## **Build Commands** + +### **Basic Build Commands** + +```bash +# Navigate to docs directory +cd docs + +# Build HTML documentation +uv run --active python -m sphinx -b html . _build/html +``` + +### **Alternative Build Commands** + +```bash +# Using Makefile (from project root) +make docs-html + +# Strict build (fails on warnings) +make docs-publish + +# With environment tags +uv run --active python -m sphinx -b html -t internal . _build/html +uv run --active python -m sphinx -b html -t ga . _build/html +uv run --active python -m sphinx -b html -t ea . _build/html +uv run --active python -m sphinx -b html -t draft . _build/html +``` + +### **Development Commands** + +```bash +# Start live-reload server +make docs-live + +# Clean built documentation +make docs-clean + +# Clean and rebuild +make docs-clean && make docs-html +``` + +## **Complete Setup & Build Process** + +### **One-time Setup:** + +```bash +# 1. Install uv (if needed) +pip install uv + +# 2. Set up environment +make docs-env +``` + +### **Regular Build Process:** + +```bash +# Option 1: Using Makefile (recommended) +make docs-html + +# Option 2: Direct Sphinx command +cd docs +uv run --active python -m sphinx -b html . _build/html +``` + +### **Development Workflow:** + +```bash +# Start live server for development +make docs-live + +# In another terminal, edit documentation files +# Changes will automatically rebuild and refresh browser +``` + +## **Output Location** + +After successful build: + +- **HTML files**: `docs/_build/html/` +- **Main index**: `docs/_build/html/nemo-run-index.html` +- **Search page**: `docs/_build/html/search.html` + +## **Troubleshooting** + +### **If uv is not found:** + +```bash +# Restart terminal after installation +# Or manually add uv to PATH +``` + +### **If virtual environment issues:** + +```bash +# Recreate environment +rm -rf .venv-docs +make docs-env +``` + +### **If build fails:** + +```bash +# Clean and rebuild +make docs-clean +make docs-html +``` + +## **Environment-Specific Builds** + +```bash +# Internal use +make docs-html-internal + +# General Availability +make docs-html-ga + +# Early Access +make docs-html-ea + +# Draft +make docs-html-draft +``` + +## **All Available Makefile Commands** + +### **Basic Commands** + +```bash +docs-html # Build HTML documentation +docs-publish # Build HTML documentation for publication (fail on warnings) +docs-clean # Clean built documentation +docs-live # Start live-reload server (sphinx-autobuild) +docs-env # Set up docs virtual environment with uv +``` + +### **Environment-Specific Commands** + +```bash +# Internal environment builds +docs-html-internal +docs-publish-internal +docs-live-internal + +# GA (General Availability) environment builds +docs-html-ga +docs-publish-ga +docs-live-ga + +# EA (Early Access) environment builds +docs-html-ea +docs-publish-ea +docs-live-ea + +# Draft environment builds +docs-html-draft +docs-publish-draft +docs-live-draft +``` + +### **Pinecone Integration Commands** + +```bash +docs-pinecone-test # Test Pinecone connection +docs-pinecone-upload-dry # Upload documentation to Pinecone (dry run) +docs-pinecone-upload # Upload documentation to Pinecone +docs-pinecone-update # Build docs and update Pinecone index +``` + +## **Cross-Platform Compatibility** + +The Makefile automatically detects your OS and uses the appropriate commands: + +- **Windows**: Uses `.venv-docs\Scripts\` paths +- **Unix/Linux/macOS**: Uses `.venv-docs/bin/` paths + +## **Usage Examples** + +```bash +# Quick build for development +make docs-html + +# Production build (strict) +make docs-publish + +# Development with live reload +make docs-live + +# Build with specific environment tag +make docs-html DOCS_ENV=ga + +# Clean and rebuild +make docs-clean && make docs-html +``` + +## **Sphinx Command Reference** + +### **Basic Sphinx Command** + +```bash +uv run --active python -m sphinx -b html . _build/html +``` + +### **Command Breakdown:** + +- `uv run --active` - Uses the active virtual environment (`.venv-docs`) +- `python -m sphinx` - Runs Sphinx as a Python module +- `-b html` - Specifies the HTML builder +- `.` - Source directory (current directory) +- `_build/html` - Output directory for the built HTML files + +### **Strict Build (Fails on Warnings):** + +```bash +uv run --active python -m sphinx --fail-on-warning --builder html . _build/html +``` + +The documentation build system is designed to be cross-platform and handles Windows PowerShell automatically through the Makefile configuration. + +## **Setting Up Documentation in a New Repository** + +### **Copying docs-example-project-setup to Your GitHub Repo** + +If you want to set up documentation in a new GitHub repository using the NVIDIA docs-example-project-setup as a template: + +#### **Step 1: Clone and Fork the Example Project** +```bash +# Clone the NVIDIA docs example project +git clone https://gitlab-master.nvidia.com/llane/docs-example-project-setup.git + +# Navigate to the cloned directory +cd docs-example-project-setup + +# Create a new branch for staging/sandbox testing +git checkout -b staging +``` + +#### **Step 1.5: Archive Source Repository Files** +```bash +# From your new GitHub repository root +# Create an archive directory to store the original source files +mkdir archive + +# Copy all files from the source repository to archive +cp -r docs-example-project-setup/* ./archive/ + +# Or if you want to preserve the git history in archive +cd archive +git clone https://gitlab-master.nvidia.com/llane/docs-example-project-setup.git +cd .. + +# This archive directory serves as your starting point reference +# You can always go back to see the original structure and content +``` + +#### **Step 2: Copy Documentation Files to Your New Repo** +```bash +# From your new GitHub repository root +# Copy the essential documentation files and directories: + +# Copy the docs directory structure +cp -r docs-example-project-setup/docs/ ./docs/ + +# Copy the Makefile (contains docs build targets) +cp docs-example-project-setup/Makefile ./ + +# Copy requirements file +cp docs-example-project-setup/requirements-docs.txt ./ + +# Copy Sphinx configuration +cp docs-example-project-setup/docs/conf.py ./docs/ + +# Copy any custom extensions +cp -r docs-example-project-setup/docs/_extensions/ ./docs/_extensions/ +``` + +#### **Step 3: Customize for Your Project** +```bash +# Edit the Sphinx configuration +# Update docs/conf.py with your project details: +# - project name +# - version +# - author +# - theme settings +# - extensions + +# Update the Makefile if needed for your project structure + +# Modify documentation content in docs/ directory +# - Update index files +# - Customize content for your project +# - Add your own documentation pages +``` + +#### **Step 4: Set Up the Documentation Environment** +```bash +# Install uv (if not already installed) +pip install uv + +# Set up the documentation environment +make docs-env + +# Build the documentation +make docs-html +``` + +#### **Step 5: Test and Iterate** +```bash +# Start live server for development +make docs-live + +# Make changes to documentation files +# Changes will automatically rebuild and refresh browser + +# Test different build environments +make docs-html-internal +make docs-html-ga +make docs-html-ea +make docs-html-draft +``` + +#### **Step 6: Commit and Push to Your Repository** +```bash +# Add all documentation files +git add . + +# Commit the changes +git commit -m "Add documentation setup from NVIDIA docs-example-project-setup" + +# Push to your staging branch +git push origin staging +``` + +### **Important Notes for Sandbox Testing** + +- **Environment Variables**: You may need to set up environment variables for features like Pinecone integration +- **Custom Extensions**: Some extensions may require additional configuration or API keys +- **Theme Customization**: The NVIDIA theme can be customized for your project branding +- **Content Structure**: Modify the documentation structure to match your project's needs +- **Build Testing**: Test all build environments (internal, ga, ea, draft) to ensure they work correctly + +### **Troubleshooting New Setup** + +```bash +# If build fails due to missing dependencies +make docs-clean +make docs-env +make docs-html + +# If extensions don't work +# Check docs/conf.py for proper extension configuration +# Verify all required packages are in requirements-docs.txt + +# If theme issues occur +# Check theme configuration in docs/conf.py +# Verify nvidia-sphinx-theme is properly installed +``` diff --git a/docs/README.md b/docs/README.md index 58ccb387..75190fc8 100644 --- a/docs/README.md +++ b/docs/README.md @@ -11,28 +11,35 @@ This is a comprehensive Sphinx documentation template designed for technical wri This template showcases advanced Sphinx documentation patterns and features: (️-complex-structure)= + ### 🏗️ **Complex Structure** + - Multi-level navigation with toctrees - Product-based content organization (Product A, B, C) - Hierarchical information architecture ### 🎨 **Modern Design** + - Grid-based layouts with responsive cards - Rich visual elements (icons, badges, images) - Professional styling with the Furo theme ### 🔗 **Advanced Navigation** + - Cross-references and internal linking - Conditional content rendering - Multi-section organization (️-sphinx-extensions)= + ### 🛠️ **Sphinx Extensions** + - MyST Markdown with advanced features - Sphinx Design for grid layouts - Custom extensions for specialized functionality ### 📊 **Content Patterns** + - Concept documentation with detailed explanations - Tutorial and how-to guide structures - Reference documentation organization @@ -61,7 +68,7 @@ docs/ │ ├── process-data/ │ └── save-export/ ├── product-c-analytics/ # Product C documentation -├── admin/ # Administrative guides +├── deployment/ # Administrative guides │ ├── deployment/ │ └── integrations/ └── reference/ # Reference documentation @@ -86,6 +93,7 @@ docs/ ## Features Demonstrated ### Grid Layouts + The template uses `sphinx-design` for responsive grid layouts: ```markdown @@ -104,6 +112,7 @@ Description text ``` ### Conditional Content + Content can be conditionally included based on build configuration: ```markdown @@ -114,6 +123,7 @@ This content only appears in non-GA builds ``` ### Cross-References + Comprehensive linking system with labeled references: ```markdown @@ -123,7 +133,7 @@ Comprehensive linking system with labeled references: Link to this section: {ref}`my-reference-label` ``` -## Building the Documentation +## Build the Documentation ```bash # Install dependencies @@ -174,14 +184,14 @@ This template provides a solid foundation for creating professional, maintainabl - [Grid Layouts](#grid-layouts) - [Conditional Content](#conditional-content) - [Cross-References](#cross-references) - - [Building the Documentation](#building-the-documentation) + - [Build the Documentation](#build-the-documentation) - [Customization Tips](#customization-tips) - [Requirements](#requirements) - [Documentation Development](#documentation-development) - [Set Up the Documentation Environment](#set-up-the-documentation-environment) - [Build the Documentation](#build-the-documentation) - [Build Variants](#build-variants) - - [Live Building](#live-building) + - [Build Live](#build-live) - [Conditional Content for Different Build Types](#conditional-content-for-different-build-types) - [1. File-Level Exclusion (Recommended for Entire Sections)](#1-file-level-exclusion-recommended-for-entire-sections) - [2. Grid Card Conditional Rendering](#2-grid-card-conditional-rendering) @@ -203,7 +213,6 @@ This template provides a solid foundation for creating professional, maintainabl - [Ansible with Mixed Syntax](#ansible-with-mixed-syntax) - [Benefits](#benefits) - ## Set Up the Documentation Environment Before building or serving the documentation, set up the docs environment using the Makefile: @@ -223,8 +232,8 @@ To build the NeMo Curator documentation, run: make docs-html ``` -* The resulting HTML files are generated in a `_build/html` folder under the project `docs/` folder. -* The generated Python API docs are placed in `apidocs` under the `docs/` folder. +- The resulting HTML files are generated in a `_build/html` folder under the project `docs/` folder. +- The generated Python API docs are placed in `apidocs` under the `docs/` folder. ### Build Variants @@ -234,7 +243,7 @@ The documentation supports different build variants: - `make docs-html-ga` - GA (General Availability) build (excludes EA-only content) - `make docs-html-ea` - EA (Early Access) build (includes all content) -## Live Building +## Build Live To serve the documentation with live updates as you edit, run: @@ -261,8 +270,9 @@ only: not ga ``` **Supported conditions:** + - `only: not ga` - Exclude from GA builds (EA-only content) -- `only: ga` - Include only in GA builds +- `only: ga` - Include only in GA builds - `only: not ea` - Exclude from EA builds - `only: internal` - Include only in internal builds @@ -274,7 +284,7 @@ Hide specific grid cards from certain builds: ```markdown :::{grid-item-card} Video Curation Features -:link: video-overview +:link: video-overview :link-type: ref :only: not ga Content for EA-only features @@ -295,7 +305,7 @@ Control navigation entries conditionally: :only: not ga ea-feature1.md -ea-feature2.md +ea-feature2.md :::: # Inline entry conditions (hides individual entries) @@ -322,7 +332,7 @@ another-standard-doc.md # Test default build (includes all content) make docs-html -# Test GA build (excludes EA-only content) +# Test GA build (excludes EA-only content) make docs-html-ga # Verify content is properly hidden/shown in each build @@ -384,7 +394,7 @@ extensions = [ # Define reusable variables myst_substitutions = { "product_name": "NeMo Curator", - "product_name_short": "Curator", + "product_name_short": "Curator", "company": "NVIDIA", "version": release, # Uses the release variable from conf.py "current_year": "2025", @@ -399,6 +409,7 @@ myst_substitutions = { ### Usage #### Basic MyST Substitutions in Text + Use `{{ variable }}` syntax in regular markdown text: ```markdown @@ -427,6 +438,7 @@ The extension intelligently protects template languages from unwanted substituti #### Protected Languages These languages are treated carefully to preserve their native `{{ }}` syntax: + - `yaml`, `yml` (Kubernetes, Docker Compose) - `helm`, `gotmpl`, `go-template` (Helm charts) - `jinja`, `jinja2`, `j2` (Ansible, Python templates) @@ -437,6 +449,7 @@ These languages are treated carefully to preserve their native `{{ }}` syntax: #### Pattern Protection The extension automatically detects and preserves common template patterns: + - `{{ .Values.something }}` (Helm values) - `{{ ansible_variable }}` (Ansible variables) - `{{ item.property }}` (Template loops) @@ -453,7 +466,7 @@ image: repository: nvcr.io/nvidia/nemo-curator tag: {{ .Values.image.tag | default "latest" }} # ← Helm template (preserved) -# Documentation URLs using MyST substitutions +# Documentation URLs using MyST substitutions downloads: releaseUrl: "https://github.com/NVIDIA/NeMo-Curator/releases/download/v{{ version }}/nemo-curator.tar.gz" # ← MyST substitution docsUrl: "{{ docs_url }}" # ← MyST substitution @@ -461,22 +474,22 @@ downloads: service: name: {{ include "nemo-curator.fullname" . }} # ← Helm template (preserved) - + env: - name: CURATOR_VERSION value: "{{ .Chart.AppVersion }}" # ← Helm template (preserved) - - name: DOCS_VERSION + - name: DOCS_VERSION value: "{{ version }}" # ← MyST substitution ``` -#### Ansible with Mixed Syntax +#### Ansible with Mixed Syntax ```yaml # MyST substitutions for documentation - name: "Install {{ product_name }} version {{ version }}" # ← MyST substitution shell: | wget {{ github_repo }}/releases/download/v{{ version }}/nemo-curator.tar.gz # ← MyST substitution - + # Ansible templates preserved when: "{{ ansible_distribution }} == 'Ubuntu'" # ← Ansible template (preserved) notify: "{{ handlers.restart_service }}" # ← Ansible template (preserved) @@ -490,4 +503,4 @@ env: 4. **Cross-Format Support**: Works in bash, python, dockerfile, and other code blocks 5. **Maintainability**: Reduces copy-paste errors and keeps documentation in sync with releases -The extension automatically handles the complexity of mixed template syntax, so you can focus on writing great documentation without worrying about breaking existing templates. \ No newline at end of file +The extension automatically handles the complexity of mixed template syntax, so you can focus on writing great documentation without worrying about breaking existing templates. diff --git a/docs/about/concepts/feature-set-a/bar.md b/docs/about/concepts/feature-set-a/bar.md deleted file mode 100644 index ca96e695..00000000 --- a/docs/about/concepts/feature-set-a/bar.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -description: "Learn about Bar concepts and their role in Feature Set A workflows and data processing." -tags: ["concepts", "bar", "workflows"] -categories: ["concepts"] ---- - -# Bar Concepts \ No newline at end of file diff --git a/docs/about/concepts/feature-set-a/bazz.md b/docs/about/concepts/feature-set-a/bazz.md deleted file mode 100644 index 728fe0ad..00000000 --- a/docs/about/concepts/feature-set-a/bazz.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -description: "Discover Bazz concepts and how they enhance Feature Set A capabilities for advanced data operations." -tags: ["concepts", "bazz", "advanced"] -categories: ["concepts"] ---- - -# Bazz Concepts \ No newline at end of file diff --git a/docs/about/concepts/feature-set-a/foo.md b/docs/about/concepts/feature-set-a/foo.md deleted file mode 100644 index 7af594a2..00000000 --- a/docs/about/concepts/feature-set-a/foo.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -description: "Understand core Foo concepts and how they integrate with Feature Set A functionality." -tags: ["concepts", "foo", "fundamentals"] -categories: ["concepts"] ---- - -# Foo Concepts \ No newline at end of file diff --git a/docs/about/concepts/feature-set-a/index.md b/docs/about/concepts/feature-set-a/index.md deleted file mode 100644 index c261c820..00000000 --- a/docs/about/concepts/feature-set-a/index.md +++ /dev/null @@ -1,44 +0,0 @@ -(about-concepts-product-a)= -# [Feature Set A] Concepts - -This page demonstrates how to document product concepts in your documentation system. - -(about-concepts-product-a-overview)= -## Overview - -This is an example concepts page that shows how you might organize conceptual information for a product feature set. In a real documentation system, this would contain: - -- Core terminology and definitions -- Key workflow concepts -- Architecture overviews -- Integration patterns - -(about-concepts-product-a-getting-started)= -## Getting Started with Concepts - -When documenting concepts for your product: - -1. **Start with fundamentals** - Define key terms and basic concepts -2. **Build complexity gradually** - Layer more advanced concepts on the basics -3. **Use visual aids** - Diagrams, flowcharts, and examples help clarify complex ideas -4. **Cross-reference extensively** - Link concepts to related documentation sections - -(about-concepts-product-a-organization)= -## Organizing Conceptual Content - -Consider organizing your concepts by: - -- **User journey stages** - Concepts users need at different points -- **Complexity levels** - Basic, intermediate, and advanced concepts -- **Product areas** - Separate concepts by feature or component -- **Use cases** - Organize around common user scenarios - -```{toctree} -:maxdepth: 2 -:titlesonly: -:hidden: - -bar -bazz -foo -``` diff --git a/docs/about/concepts/index.md b/docs/about/concepts/index.md deleted file mode 100644 index 47685ddd..00000000 --- a/docs/about/concepts/index.md +++ /dev/null @@ -1,23 +0,0 @@ -(about-concepts)= -# Core Concepts - -Learn about organizing documentation for specific product areas and workflows. - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Feature Set A Concepts -:link: about-concepts-product-a -:link-type: ref - -Example concepts page showing how to document product workflows, including data loading, processing, and output generation patterns. -::: - -:::: - -```{toctree} -:hidden: -:maxdepth: 2 - -[Feature Set A] Concepts -``` diff --git a/docs/about/index.md b/docs/about/index.md index 316e1f16..a46f4278 100644 --- a/docs/about/index.md +++ b/docs/about/index.md @@ -1,70 +1,168 @@ --- -description: "Learn about our platform's core concepts, key features, and fundamental architecture to understand how it works." -tags: ["overview", "concepts", "architecture", "features"] -categories: ["concepts"] +description: "Learn about NeMo Run's core concepts, key features, and fundamental architecture for ML experiment management and distributed computing." +tags: ["overview", "concepts", "architecture", "features", "ml", "distributed-computing"] +categories: ["about"] --- (about-overview)= -# About {{ product_name_short }} -This Documentation Template is an open-source, comprehensive showcase for scalable, modern documentation structures across multiple product areas and content types. -This template helps you create high-quality, well-structured documentation for complex software products and enterprise platforms. Whether you work with web documentation, internal knowledge bases, or public-facing product docs, this template supports your workflow. +# About NeMo Run -(about-overview-target-users)= -## Target Users +NeMo Run is a comprehensive Python framework for configuring, executing, and managing machine learning experiments across diverse computing environments. Built with a focus on reproducibility, flexibility, and scalability, NeMo Run decouples experiment configuration from execution, enabling researchers and ML engineers to seamlessly transition between local development, cloud platforms, and high-performance computing clusters. + +## What is NeMo Run? + +NeMo Run provides a unified interface for ML experiment lifecycle management, addressing the common challenges of: + +- **Configuration Management**: Complex, nested configurations for models, data, and training parameters +- **Execution Orchestration**: Running experiments across different environments (local, Docker, Slurm, Kubernetes, cloud) +- **Experiment Tracking**: Managing, monitoring, and reproducing experiments with full metadata preservation + +The framework is built on three core pillars: + +::::{grid} 1 1 1 3 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration +:link: ../guides/configuration +:link-type: doc +:link-alt: Configuration guide + +Python-based configuration using Fiddle, supporting complex nested structures and type safety +::: + +:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Execution +:link: ../guides/execution +:link-type: doc +:link-alt: Execution guide + +Multi-environment execution with executors for local, Docker, Slurm, Kubernetes, and cloud platforms +::: + +:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Management +:link: ../guides/management +:link-type: doc +:link-alt: Management guide + +Experiment lifecycle management with metadata tracking, logging, and reproducibility +::: + +:::: + +## Why Use NeMo Run? + +NeMo Run addresses critical issues in ML experiment management through its unique approach: + +### 🔧 **Configuration Flexibility** + +NeMo Run's Python-based configuration system provides unprecedented flexibility: + +- **Type-Safe Configurations**: Automatic validation using Python's type annotations +- **Nested Configuration Support**: Intuitive dot notation for complex parameter hierarchies +- **Fiddle Integration**: Built on Google's Fiddle framework for robust configuration management +- **YAML Interoperability**: Support for external configuration files with seamless Python integration -- **Technical writers and documentation engineers**: Build and maintain comprehensive documentation systems for complex products. -- **Documentation managers and information architects**: Deploy and scale documentation projects across teams and product lines. -- **Open source maintainers**: Create professional documentation structures for community projects and developer tools. -- **Enterprise teams**: Ensure documentation consistency, accessibility, and quality for production software systems. +### 🚀 **Execution Modularity** -(about-overview-how-it-works)= -## How It Works +The framework's execution system enables true environment independence: -This template accelerates documentation development by using modern Sphinx extensions and proven content architecture patterns. You can structure content efficiently—from a single product to multi-product ecosystems. With modular layouts, advanced navigation, and seamless integration with modern documentation tools, this template is trusted by technical writing teams. +- **Executor Abstraction**: Mix and match tasks with different execution environments +- **Multi-Platform Support**: Local, Docker, Slurm, Kubernetes, and cloud platforms +- **Code Packaging**: Intelligent packaging strategies (Git archive, pattern-based, hybrid) +- **Launcher Integration**: Support for torchrun, fault tolerance, and custom launchers -- **Product A Workflows**: Content flows through structured sections (loading, processing, reporting), organized with clear navigation hierarchies and cross-references. -- **Product B Integration**: Uses grid layouts, card-based navigation, and modular content organization for complex integration scenarios. -- **Product C Analytics**: Built with advanced Sphinx features, conditional content rendering, and scalable information architecture patterns. +### 📊 **Experiment Management** -For more details, see the [Core Concepts](about-concepts) and [Key Features](about-key-features) sections below. +Comprehensive experiment tracking and management capabilities: -(about-overview-key-technologies)= -### Key Technologies +- **Metadata Preservation**: Automatic capture of configurations, logs, and artifacts +- **Reproducibility**: One-command experiment reconstruction from metadata +- **Status Monitoring**: Real-time experiment status and log access +- **Dependency Management**: Complex workflow orchestration with task dependencies -- **Sphinx Documentation**: Modern documentation generation with powerful extensions and themes. -- **MyST Markdown**: Advanced markdown parsing with rich directive support and cross-referencing. -- **Grid Layouts**: Responsive, card-based content organization for complex product documentation. -- **Conditional Content**: Dynamic content rendering based on build configuration and target audiences. +## Target Users + +NeMo Run is designed for ML practitioners who need robust experiment management: + +- **ML Researchers**: Conducting experiments across multiple environments with full reproducibility +- **ML Engineers**: Building production ML pipelines with consistent configuration management +- **DevOps Engineers**: Managing ML infrastructure across diverse computing platforms +- **Data Scientists**: Prototyping and scaling ML experiments with minimal infrastructure overhead + +## Key Technologies + +NeMo Run leverages modern Python ecosystem technologies: + +- **Fiddle**: Google's configuration framework for type-safe, composable configurations +- **TorchX**: PyTorch's job submission framework for distributed execution +- **Docker**: Container-based execution for consistent environments +- **Ray**: Distributed computing framework integration for scalable ML workloads +- **Typer**: Modern CLI framework for rich command-line interfaces -(about-overview-core-concepts)= -## Core Concepts +## Core Architecture -Explore the foundational concepts and organizational patterns used across this documentation template. +NeMo Run's architecture follows a clean separation of concerns: ::::{grid} 1 1 1 2 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Product A Concepts -:link: about-concepts-product-a -:link-type: ref +:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` Configuration Layer +:link: ../guides/configuration +:link-type: doc +:link-alt: Configuration guide -Explore key concepts for Product A workflows, including scalable data loading, processing (transformation, validation, filtering), and report generation. +Fiddle-based configuration system with type safety and validation ::: +:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Execution Layer +:link: ../guides/execution +:link-type: doc +:link-alt: Execution guide +Executor abstraction with multi-platform support and intelligent packaging +::: + +:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Management Layer +:link: ../guides/management +:link-type: doc +:link-alt: Management guide + +Experiment lifecycle management with metadata tracking and reproducibility +::: + +:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` Interface Layer +:link: ../reference/faqs +:link-type: doc +:link-alt: Reference + +Rich CLI interface with type-safe argument parsing and configuration overrides +::: :::: -(about-overview-about-template)= -## About This Template +## Getting Started + +Ready to start using NeMo Run? Begin with these essential guides: + +::::{grid} 1 1 1 2 +:gutter: 1 1 1 2 -This template demonstrates advanced Sphinx documentation patterns including: +:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Quick Start +:link: ../get-started/index +:link-type: doc +:link-alt: Get started guide -- **Complex Navigation**: Multi-level toctrees with conditional content -- **Rich Content Layout**: Grid systems, cards, and responsive design -- **Cross-Reference Systems**: Comprehensive linking and reference management -- **Extension Integration**: Custom Sphinx extensions and advanced features -- **Scalable Architecture**: Patterns that work from small projects to enterprise-scale documentation +Set up your first NeMo Run experiment in minutes +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Key Features +:link: key-features +:link-type: doc +:link-alt: Key features + +Explore the technical capabilities and implementation details +::: + +:::: -Perfect for teams who need to create sophisticated, maintainable documentation systems. +For detailed information about specific features, explore the [Configuration](../guides/configuration), [Execution](../guides/execution), and [Management](../guides/management) guides. diff --git a/docs/about/key-features.md b/docs/about/key-features.md index b9b61216..346dd47c 100644 --- a/docs/about/key-features.md +++ b/docs/about/key-features.md @@ -1,5 +1,864 @@ -(about-key-features)= -# Key Features +--- +description: "Comprehensive technical overview of NeMo Run's advanced capabilities for AI research and ML experiment management, including distributed computing, configuration systems, and experiment orchestration." +tags: ["features", "capabilities", "technical", "implementation", "ml", "experiment-management", "ai-research", "distributed-computing"] +categories: ["about"] +--- -This page showcases the key features and capabilities of the product for buyers, technical decision makers, and evaluators assessing applicability and fit for their use cases. +(key-features)= +# Technical Capabilities and Features + +NeMo Run provides a comprehensive framework designed specifically for AI researchers and ML practitioners, offering advanced capabilities for experiment management, distributed computing, and reproducible research workflows. This document provides a detailed technical overview of the core systems and implementation features that power NeMo Run's functionality. + +## Core Architecture Overview + +NeMo Run's architecture is built around several interconnected systems that provide a unified interface for ML experiment management: + +- **Configuration System**: Type-safe, composable configuration management +- **Execution Framework**: Multi-backend execution across diverse computing environments +- **Experiment Orchestration**: Advanced experiment lifecycle management +- **Distributed Computing**: Ray integration for scalable distributed training +- **Packaging System**: Reproducible code packaging and deployment +- **CLI Framework**: Intelligent command-line interface with type safety + +## Advanced Configuration System + +### Type-Safe Configuration Management + +NeMo Run's configuration system provides compile-time type safety and runtime validation, ensuring configuration correctness and enabling advanced IDE support. + +#### Core Configuration Classes + +**`run.Config`** - Direct configuration objects with type validation + +- **Type Safety**: Compile-time type checking with Python's type system +- **Validation**: Runtime validation with custom validation rules +- **Nested Configuration**: Hierarchical configuration with dot notation access +- **Data Class Integration**: Seamless integration with Python data classes +- **Transformation**: Configuration broadcasting and functional transformations + +**`run.Partial`** - Lazy configuration with CLI integration + +- **Lazy Evaluation**: Deferred configuration resolution +- **CLI Integration**: Automatic parameter exposure to command line +- **Factory Support**: Complex object creation through factory functions +- **Composition**: Configuration inheritance and composition patterns +- **Type Inference**: Automatic type inference from function signatures + +**`run.Script`** - Script-based execution configurations + +- **External Scripts**: Execution of external scripts with parameter passing +- **Environment Management**: Comprehensive environment variable control +- **Path Configuration**: Working directory and path management +- **Validation**: Script validation and preprocessing capabilities + +#### Advanced Configuration Features + +- **Configuration Walking**: Functional transformation of configuration trees +- **Configuration Diffing**: Visual comparison of configuration changes +- **Multi-Format Export**: Export to YAML, TOML, JSON, or Python code +- **Configuration Broadcasting**: Apply changes across nested structures +- **Validation Rules**: Custom validation logic for complex constraints + +::::{dropdown} Advanced Configuration Example +:icon: code-square + +```python +from dataclasses import dataclass +from typing import Optional, List +import nemo_run as run + +@dataclass +class ModelConfig: + architecture: str = "transformer" + hidden_size: int = 512 + num_layers: int = 12 + num_heads: int = 8 + dropout: float = 0.1 + activation: str = "gelu" + + def __post_init__(self): + assert self.hidden_size % self.num_heads == 0, "hidden_size must be divisible by num_heads" + +@dataclass +class TrainingConfig: + learning_rate: float = 1e-4 + batch_size: int = 32 + epochs: int = 100 + optimizer: str = "adam" + scheduler: Optional[str] = "cosine" + warmup_steps: int = 1000 + gradient_clipping: float = 1.0 + +@dataclass +class DataConfig: + dataset_path: str + tokenizer_path: str + max_length: int = 512 + num_workers: int = 4 + +# Direct configuration with validation +config = run.Config( + ModelConfig, + hidden_size=1024, + num_layers=24, + num_heads=16 +) +model = config.build() # Returns validated ModelConfig instance + +# Partial configuration with CLI integration +@run.partial +def train_model( + model: ModelConfig, + training: TrainingConfig, + data: DataConfig, + seed: int = 42 +): + """Train a machine learning model with given configuration.""" + # Training implementation + pass + +# CLI usage with type safety +# python train.py model.hidden_size=1024 training.learning_rate=2e-4 data.max_length=1024 +``` + +:::: + +## Multi-Environment Execution Framework + +### Execution Backend Architecture + +NeMo Run provides a unified execution interface across diverse computing environments, from local development to large-scale distributed clusters. + +#### Local Execution Environment + +**`run.LocalExecutor`** - Local process execution with resource management + +- **Process Isolation**: Isolated execution environments +- **Resource Management**: CPU and memory allocation control +- **Environment Variables**: Comprehensive environment configuration +- **Working Directory**: Path and working directory management +- **Log Capture**: Centralized log collection and redirection + +#### Containerized Execution + +**`run.DockerExecutor`** - Docker container execution with GPU support + +- **Custom Images**: Support for custom container images +- **GPU Acceleration**: Native GPU support with CUDA integration +- **Volume Management**: Flexible volume mounting and file sharing +- **Network Configuration**: Port forwarding and network setup +- **Resource Constraints**: CPU, memory, and GPU limits + +#### High-Performance Computing + +**`run.SlurmExecutor`** - HPC cluster execution via Slurm + +- **Job Submission**: Native Slurm job submission and management +- **Multi-Node Support**: Distributed execution across multiple nodes +- **GPU Allocation**: Multi-GPU and multi-node GPU support +- **Resource Scheduling**: Advanced resource allocation and scheduling +- **SSH Tunneling**: Secure remote access to cluster resources + +#### Cloud Computing Platforms + +**`run.SkypilotExecutor`** - Multi-cloud execution with cost optimization + +- **Multi-Cloud Support**: AWS, GCP, Azure, and Lambda Cloud +- **Automatic Provisioning**: On-demand resource provisioning +- **Cost Optimization**: Spot instance and cost-aware scheduling +- **Cloud Optimizations**: Platform-specific performance optimizations + +**`run.DGXCloudExecutor`** - NVIDIA DGX Cloud execution + +- **DGX Integration**: Native DGX Cloud cluster management +- **Lepton Integration**: Seamless Lepton service integration +- **GPU Allocation**: Optimized GPU resource allocation +- **Cloud-Native Features**: Leverage cloud-native capabilities + +**`run.LeptonExecutor`** - Lepton cloud execution + +- **Lepton Deployment**: Automated Lepton cluster deployment +- **Auto-Scaling**: Dynamic resource scaling based on workload +- **Cost Tracking**: Real-time cost monitoring and optimization +- **Service Integration**: Integration with Lepton ecosystem services + +::::{dropdown} Multi-Environment Execution Example +:icon: code-square + +```python +import nemo_run as run +from nemo_run.core.execution.docker import DockerExecutor +from nemo_run.core.execution.slurm import SlurmExecutor +from nemo_run.core.execution.skypilot import SkyPilotExecutor + +# Local execution for development +local_exec = run.LocalExecutor( + env_vars={ + "CUDA_VISIBLE_DEVICES": "0,1", + "PYTHONPATH": "/path/to/project", + "WANDB_PROJECT": "ml-experiment" + }, + working_dir="/path/to/project" +) + +# Docker execution for reproducible environments +docker_exec = DockerExecutor( + container_image="nvidia/pytorch:24.05-py3", + num_gpus=4, + volumes=[ + "/data:/data", + "/models:/models", + "/cache:/cache" + ], + ports=[8080:8080], + env_vars={ + "NCCL_DEBUG": "INFO", + "CUDA_VISIBLE_DEVICES": "0,1,2,3" + } +) + +# Slurm execution for HPC clusters +slurm_exec = SlurmExecutor( + nodes=4, + gpus_per_node=8, + time="12:00:00", + account="gpu-dept", + partition="a100", + qos="high", + env_vars={ + "NCCL_IB_DISABLE": "0", + "NCCL_DEBUG": "INFO" + } +) + +# SkyPilot execution for cloud computing +skypilot_exec = SkyPilotExecutor( + cloud="aws", + instance_type="g4dn.12xlarge", + region="us-west-2", + spot=True, # Use spot instances for cost optimization + env_vars={ + "WANDB_API_KEY": "your-api-key" + } +) + +# Unified execution interface +@run.partial +def distributed_training(config, num_epochs=100): + """Distributed training function.""" + # Training implementation + pass + +# Execute on different environments +result_local = distributed_training.with_executor(local_exec)(config, num_epochs=10) +result_docker = distributed_training.with_executor(docker_exec)(config, num_epochs=100) +result_slurm = distributed_training.with_executor(slurm_exec)(config, num_epochs=1000) +result_cloud = distributed_training.with_executor(skypilot_exec)(config, num_epochs=500) +``` + +:::: + +## Advanced Experiment Management + +### Experiment Lifecycle Orchestration + +NeMo Run's experiment management system provides comprehensive lifecycle management for complex ML experiments, enabling reproducible research and systematic experimentation. + +#### Experiment Lifecycle Management + +**`run.Experiment`** - Advanced experiment orchestration + +- **Experiment Creation**: Structured experiment initialization +- **Task Orchestration**: Complex task dependency management +- **Execution Monitoring**: Real-time execution monitoring +- **Metadata Capture**: Comprehensive metadata collection +- **Reproducibility**: Full experiment reproduction capabilities + +#### Advanced Task Management + +- **Task Dependencies**: Complex dependency graphs and workflows +- **Parallel Execution**: Concurrent execution of independent tasks +- **Resource Allocation**: Intelligent resource allocation and scheduling +- **Status Tracking**: Real-time task status and progress monitoring +- **Error Handling**: Robust error handling and recovery mechanisms + +#### Comprehensive Metadata Management + +- **Configuration Serialization**: Automatic configuration capture and serialization +- **Log Aggregation**: Centralized log collection and analysis +- **Artifact Tracking**: Automatic artifact collection and versioning +- **Experiment Reconstruction**: Full experiment reproduction from metadata +- **Performance Metrics**: Comprehensive performance monitoring and analysis + +::::{dropdown} Advanced Experiment Management Example +:icon: code-square + +```python +import nemo_run as run +from typing import Dict, Any + +# Create comprehensive experiment +with run.Experiment( + name="transformer-architecture-study", + description="Comprehensive study of transformer architectures", + tags=["transformer", "nlp", "research"] +) as exp: + + # Define hyperparameter space + model_configs = [ + {"hidden_size": 512, "num_layers": 12, "num_heads": 8}, + {"hidden_size": 768, "num_layers": 12, "num_heads": 12}, + {"hidden_size": 1024, "num_layers": 24, "num_heads": 16}, + {"hidden_size": 1536, "num_layers": 36, "num_heads": 24} + ] + + training_configs = [ + {"learning_rate": 1e-4, "batch_size": 32}, + {"learning_rate": 2e-4, "batch_size": 64}, + {"learning_rate": 5e-4, "batch_size": 128} + ] + + # Create training tasks + training_tasks = [] + for model_config in model_configs: + for training_config in training_configs: + task = exp.add( + train_transformer, + model_config=model_config, + training_config=training_config, + executor=slurm_exec, + name=f"train-{model_config['hidden_size']}-{training_config['learning_rate']}" + ) + training_tasks.append(task) + + # Add evaluation task with dependencies + exp.add( + evaluate_models, + dependencies=training_tasks, + executor=local_exec, + name="comprehensive-evaluation" + ) + + # Add analysis task + exp.add( + analyze_results, + dependencies=training_tasks, + executor=local_exec, + name="results-analysis" + ) + + # Launch experiment with monitoring + exp.run( + tail_logs=True, + sequential=False, + max_concurrent=4 + ) + +# Later experiment reconstruction and analysis +exp = run.Experiment.from_id("transformer-architecture-study_20241201_123456") + +# Access experiment metadata +metadata = exp.metadata() +configs = exp.configurations() +results = exp.results() + +# Analyze specific task +task_logs = exp.logs("train-1024-2e-4") +task_metrics = exp.metrics("train-1024-2e-4") +``` + +:::: + +## Distributed Computing with Ray + +### Ray Integration Architecture + +NeMo Run provides comprehensive Ray integration for distributed computing, enabling scalable ML training and inference across diverse computing environments. + +#### Ray Cluster Management + +**`run.ray.cluster.RayCluster`** - Advanced Ray cluster lifecycle management + +- **Cluster Creation**: Automated cluster creation and initialization +- **Resource Allocation**: Dynamic resource allocation and configuration +- **Port Forwarding**: Secure dashboard access and monitoring +- **Health Monitoring**: Cluster health monitoring and auto-recovery +- **Resource Cleanup**: Automatic resource cleanup and management + +#### Ray Job Management + +**`run.ray.job.RayJob`** - Ray job submission and monitoring + +- **Job Submission**: Advanced job submission to Ray clusters +- **Runtime Environment**: Comprehensive runtime environment configuration +- **Log Streaming**: Real-time log streaming and monitoring +- **Status Tracking**: Detailed job status tracking and management +- **Error Recovery**: Automatic error recovery and retry mechanisms + +#### Multi-Environment Ray Support + +- **KubeRay**: Kubernetes-based Ray cluster management +- **Slurm Ray**: Ray clusters on HPC systems via Slurm +- **Local Ray**: Local Ray cluster for development and testing +- **Cloud Ray**: Cloud-based Ray clusters with auto-scaling + +::::{dropdown} Advanced Ray Integration Example +:icon: code-square + +```python +from nemo_run.run.ray.cluster import RayCluster +from nemo_run.run.ray.job import RayJob +from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup + +# Configure advanced KubeRay executor +executor = KubeRayExecutor( + namespace="ml-research", + ray_version="2.43.0", + image="anyscale/ray:2.43.0-py312-cu125", + head_cpu="8", + head_memory="32Gi", + worker_groups=[ + KubeRayWorkerGroup( + group_name="gpu-workers", + replicas=4, + gpus_per_worker=8, + cpu_per_worker="32", + memory_per_worker="128Gi", + min_replicas=2, + max_replicas=8 + ), + KubeRayWorkerGroup( + group_name="cpu-workers", + replicas=2, + cpu_per_worker="16", + memory_per_worker="64Gi" + ) + ], + volume_mounts=[ + {"name": "workspace", "mountPath": "/workspace"}, + {"name": "datasets", "mountPath": "/datasets"} + ], + volumes=[ + { + "name": "workspace", + "persistentVolumeClaim": {"claimName": "ml-workspace-pvc"} + }, + { + "name": "datasets", + "persistentVolumeClaim": {"claimName": "datasets-pvc"} + } + ], + env_vars={ + "NCCL_DEBUG": "INFO", + "CUDA_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7" + } +) + +# Deploy Ray cluster +cluster = RayCluster(name="research-cluster", executor=executor) +cluster.start( + timeout=1800, + pre_ray_start_commands=[ + "pip install -r requirements.txt", + "mkdir -p /workspace/cache" + ] +) + +# Set up dashboard access +cluster.port_forward(port=8265, target_port=8265) +print("Ray dashboard available at: http://localhost:8265") + +# Submit distributed training job +job = RayJob(name="distributed-training", executor=executor) +job.start( + command="python train.py --config configs/distributed.yaml", + workdir="/workspace/project", + runtime_env_yaml="/workspace/project/runtime_env.yaml", + pre_ray_start_commands=[ + "pip install -r requirements.txt" + ] +) + +# Monitor job execution +job.logs(follow=True) + +# Clean up resources +cluster.stop() +``` + +:::: + +## Intelligent CLI Framework + +### Advanced Command-Line Interface + +NeMo Run's CLI system provides intelligent command-line interaction with type safety, automatic parameter discovery, and advanced configuration management. + +#### CLI Architecture + +**`run.cli.entrypoint`** - Advanced CLI entry point decorator + +- **Parameter Discovery**: Automatic parameter exposure from function signatures +- **Type Safety**: Type-safe argument parsing and validation +- **Nested Configuration**: Dot notation for nested configuration overrides +- **Error Correction**: Intelligent error correction and suggestions +- **Auto-Completion**: Advanced auto-completion for parameters and values + +#### Advanced CLI Features + +- **Factory Functions**: Complex object creation via CLI with factory patterns +- **Configuration Files**: Dynamic configuration loading with `@` syntax +- **Dry Run Mode**: Execution preview without actual execution +- **Configuration Export**: Multi-format configuration export capabilities +- **Rich Output**: Formatted tables, syntax highlighting, and progress indicators + +::::{dropdown} Advanced CLI Example +:icon: code-square + +```python +import nemo_run as run +from dataclasses import dataclass +from typing import Optional, List + +@dataclass +class ModelConfig: + architecture: str = "transformer" + hidden_size: int = 512 + num_layers: int = 12 + dropout: float = 0.1 + +@dataclass +class TrainingConfig: + learning_rate: float = 1e-4 + batch_size: int = 32 + epochs: int = 100 + optimizer: str = "adam" + +@run.cli.entrypoint +def train_model( + model: ModelConfig, + training: TrainingConfig, + data_path: str, + output_dir: str, + seed: int = 42, + debug: bool = False +): + """Train a machine learning model with comprehensive configuration.""" + # Training implementation + pass + +# CLI usage examples: +# Basic parameter overrides +# python train.py model.hidden_size=1024 training.learning_rate=2e-4 + +# Configuration file loading +# python train.py --factory @configs/base.yaml model.layers=24 + +# Nested configuration with operations +# python train.py model.size*=2 training.batch_size+=16 + +# Factory function usage +# python train.py --factory executor=@executors/slurm.yaml + +# Dry run to preview execution +# python train.py --dryrun model.size=512 + +# Export configuration +# python train.py --to-yaml config.yaml model.size=512 + +# Advanced parameter validation +# python train.py model.hidden_size=1024 training.learning_rate=2e-4 data_path=/path/to/data +``` + +:::: + +## Advanced Packaging System + +### Reproducible Code Packaging + +NeMo Run's packaging system ensures reproducible execution across different environments by providing flexible and efficient code packaging strategies. + +#### Packaging Strategies + +**`run.GitArchivePackager`** - Git-based packaging for version control + +- **Git Integration**: Package committed code using git archive +- **Version Control**: Automatic version tracking and reproducibility +- **Dependency Resolution**: Automatic dependency resolution from git +- **Clean Packages**: Reproducible, clean package generation + +**`run.PatternPackager`** - Pattern-based selective packaging + +- **Selective Inclusion**: File inclusion with glob patterns +- **Custom Rules**: Advanced inclusion/exclusion rules +- **File Filtering**: Custom file filtering and transformation +- **Flexible Strategies**: Adaptable packaging strategies + +**`run.HybridPackager`** - Combined packaging strategies + +- **Strategy Combination**: Multiple packaging strategy integration +- **Custom Logic**: Custom packaging logic and rules +- **Conditional Packaging**: Conditional packaging based on context +- **Advanced Workflows**: Complex packaging workflows + +::::{dropdown} Advanced Packaging Example +:icon: code-square + +```python +import nemo_run as run +import os + +# Git archive packaging for version control +git_packager = run.GitArchivePackager( + subpath="src", # Package only src directory + exclude_patterns=["*.pyc", "__pycache__"] +) + +# Pattern-based packaging for selective inclusion +pattern_packager = run.PatternPackager( + include=[ + "*.py", + "*.yaml", + "*.json", + "configs/**/*", + "models/**/*" + ], + exclude=[ + "__pycache__", + "*.pyc", + "tests/", + "docs/", + ".git/", + "*.log" + ], + relative_path=os.getcwd() +) + +# Hybrid packaging combining multiple strategies +hybrid_packager = run.HybridPackager([ + git_packager, + pattern_packager +]) + +# Use with executor for reproducible execution +executor = run.SlurmExecutor( + packager=hybrid_packager, + nodes=2, + gpus_per_node=8 +) + +# Execute with packaged code +@run.partial(executor=executor) +def distributed_training(config): + """Distributed training with packaged code.""" + pass +``` + +:::: + +## Extensible Plugin System + +### Plugin Architecture for Customization + +NeMo Run's plugin system enables advanced customization and extension of functionality through a flexible plugin architecture. + +#### Plugin Framework + +**`run.Plugin`** - Base plugin class for extensibility + +- **Task Modification**: Modify tasks before execution +- **Executor Enhancement**: Enhance executor capabilities +- **Configuration Injection**: Inject configuration parameters +- **Environment Setup**: Custom environment setup and cleanup +- **Custom Logic**: Implement custom experiment logic + +#### Advanced Plugin Features + +- **Setup Hooks**: Pre-execution task and executor modification +- **Configuration Injection**: Dynamic configuration parameter injection +- **Environment Management**: Custom execution environment setup +- **Custom Logic**: Implementation of custom experiment logic +- **Plugin Composition**: Plugin composition and chaining + +::::{dropdown} Advanced Plugin Example +:icon: code-square + +```python +import nemo_run as run +from typing import Dict, Any + +class AdvancedLoggingPlugin(run.Plugin): + """Advanced logging plugin with custom configuration.""" + + def __init__(self, log_level: str = "INFO", log_file: str = None): + self.log_level = log_level + self.log_file = log_file + + def setup(self, task, executor): + """Setup logging configuration for task and executor.""" + # Configure executor environment + if hasattr(executor, 'env_vars'): + executor.env_vars.update({ + 'LOG_LEVEL': self.log_level, + 'LOG_FILE': self.log_file or '/tmp/nemo_run.log', + 'WANDB_PROJECT': 'nemo-run-experiments' + }) + + # Modify task configuration + if hasattr(task, 'config'): + task.config.logging = { + 'level': self.log_level, + 'file': self.log_file, + 'wandb': True + } + +class ResourceMonitoringPlugin(run.Plugin): + """Resource monitoring plugin for performance tracking.""" + + def setup(self, task, executor): + """Setup resource monitoring.""" + if hasattr(executor, 'env_vars'): + executor.env_vars.update({ + 'ENABLE_MONITORING': 'true', + 'MONITOR_INTERVAL': '60' + }) + +class CustomValidationPlugin(run.Plugin): + """Custom validation plugin for configuration validation.""" + + def setup(self, task, executor): + """Validate configuration before execution.""" + if hasattr(task, 'config'): + # Custom validation logic + if task.config.get('learning_rate', 0) <= 0: + raise ValueError("Learning rate must be positive") + +# Use multiple plugins +plugins = [ + AdvancedLoggingPlugin(log_level="DEBUG"), + ResourceMonitoringPlugin(), + CustomValidationPlugin() +] + +# Execute with plugins +run.run(config, executor=executor, plugins=plugins) +``` + +:::: + +## Secure Tunneling System + +### Network Security and Remote Access + +NeMo Run's tunneling system provides secure remote access to computing resources with comprehensive network security features. + +#### SSH Tunneling + +**`run.SSHTunnel`** - Advanced SSH tunnel management + +- **Secure Access**: Encrypted remote access to clusters +- **Port Forwarding**: Dynamic port forwarding and connection management +- **Authentication**: Multi-factor authentication and key management +- **Connection Monitoring**: Real-time connection health monitoring +- **Auto-Reconnection**: Automatic reconnection on connection loss + +#### Local Tunneling + +**`run.LocalTunnel`** - Local tunnel management + +- **Local Forwarding**: Local port forwarding for service access +- **Service Discovery**: Automatic service discovery and connection +- **Network Configuration**: Advanced network configuration options +- **Tunnel Lifecycle**: Comprehensive tunnel lifecycle management + +::::{dropdown} Advanced Tunneling Example +:icon: code-square + +```python +import nemo_run as run +from pathlib import Path + +# Advanced SSH tunnel configuration +ssh_tunnel = run.SSHTunnel( + host="login.cluster.com", + user="researcher", + identity="~/.ssh/id_ed25519", + job_dir="/scratch/researcher/runs", + port=22, + timeout=30, + compression=True, + keepalive_interval=60 +) + +# Use with Slurm executor for secure remote access +executor = run.SlurmExecutor( + tunnel=ssh_tunnel, + nodes=4, + gpus_per_node=8, + account="gpu-dept", + partition="a100" +) + +# Execute with secure tunnel +@run.partial(executor=executor) +def secure_training(config): + """Secure training on remote cluster.""" + pass + +result = secure_training(config) +``` + +:::: + +## Performance Optimization Features + +### Advanced Performance Capabilities + +NeMo Run includes sophisticated performance optimization features designed for high-performance ML workloads. + +#### Resource Optimization + +- **Intelligent Scheduling**: Advanced resource scheduling algorithms +- **Load Balancing**: Automatic load balancing across resources +- **Resource Monitoring**: Real-time resource utilization monitoring +- **Performance Profiling**: Built-in performance profiling capabilities +- **Optimization Recommendations**: AI-driven optimization suggestions + +#### Scalability Features + +- **Auto-Scaling**: Automatic scaling based on workload demands +- **Horizontal Scaling**: Seamless horizontal scaling across nodes +- **Vertical Scaling**: Dynamic vertical scaling of resources +- **Cost Optimization**: Intelligent cost optimization strategies +- **Performance Tuning**: Automatic performance tuning and optimization + +## Research and Development Features + +### Advanced Research Capabilities + +NeMo Run provides specialized features for AI research and development workflows. + +#### Experimentation Support + +- **Hyperparameter Optimization**: Built-in hyperparameter optimization +- **A/B Testing**: Comprehensive A/B testing framework +- **Reproducibility**: Full experiment reproducibility guarantees +- **Version Control**: Integration with version control systems +- **Collaboration**: Multi-user collaboration features + +#### Research Workflow Integration + +- **Jupyter Integration**: Seamless Jupyter notebook integration +- **Paper Trail**: Automatic generation of experiment documentation +- **Citation Support**: Built-in citation and attribution tracking +- **Publication Ready**: Publication-ready result formatting +- **Open Science**: Support for open science practices + +These advanced technical capabilities provide AI researchers with a comprehensive framework for conducting sophisticated, reproducible, and scalable ML experiments across diverse computing environments. NeMo Run's architecture is designed to support the complex requirements of modern AI research while maintaining simplicity and usability. + +--- + +:::{note} +**For AI Researchers**: NeMo Run's architecture is specifically designed to support the complex workflows of AI research, providing both the flexibility needed for experimentation and the rigor required for reproducible science. The system's modular design allows researchers to focus on their core research while leveraging advanced infrastructure capabilities. +::: diff --git a/docs/about/release-notes/index.md b/docs/about/release-notes/index.md deleted file mode 100644 index 349a0cc9..00000000 --- a/docs/about/release-notes/index.md +++ /dev/null @@ -1,2 +0,0 @@ -(about-release-notes)= -# Release Notes \ No newline at end of file diff --git a/docs/about/why-nemo-run.md b/docs/about/why-nemo-run.md new file mode 100644 index 00000000..6ec37bd5 --- /dev/null +++ b/docs/about/why-nemo-run.md @@ -0,0 +1,150 @@ +--- +description: "Discover why NeMo Run is the preferred choice for ML experiment management, featuring configuration flexibility, execution modularity, and comprehensive experiment tracking." +tags: ["benefits", "advantages", "features", "ml", "experiment-management", "why-choose"] +categories: ["about"] +--- + +(why-nemo-run)= + +# Why Choose NeMo Run? + +NeMo Run is designed to solve the most critical challenges in machine learning experiment management. Here's why researchers, ML engineers, and data scientists choose NeMo Run for their workflows. + +## Key Benefits + +### 🔧 **Configuration Flexibility** + +NeMo Run's Python-based configuration system provides unprecedented flexibility and type safety: + +- **Type-Safe Configurations**: Automatic validation using Python's type annotations prevents configuration errors +- **Nested Configuration Support**: Intuitive dot notation for complex parameter hierarchies +- **Fiddle Integration**: Built on Google's Fiddle framework for robust configuration management +- **YAML Interoperability**: Support for external configuration files with seamless Python integration +- **Dynamic Configuration**: Runtime configuration updates and overrides without code changes + +### 🚀 **Execution Modularity** + +The framework's execution system enables true environment independence: + +- **Executor Abstraction**: Mix and match tasks with different execution environments +- **Multi-Platform Support**: Local, Docker, Slurm, Kubernetes, and cloud platforms +- **Code Packaging**: Intelligent packaging strategies (Git archive, pattern-based, hybrid) +- **Launcher Integration**: Support for torchrun, fault tolerance, and custom launchers +- **Resource Management**: Automatic resource allocation and cleanup + +### 📊 **Experiment Management** + +Comprehensive experiment tracking and management capabilities: + +- **Metadata Preservation**: Automatic capture of configurations, logs, and artifacts +- **Reproducibility**: One-command experiment reconstruction from metadata +- **Status Monitoring**: Real-time experiment status and log access +- **Dependency Management**: Complex workflow orchestration with task dependencies +- **Artifact Management**: Comprehensive artifact collection and storage + +## Use Cases + +### **ML Research & Development** + +NeMo Run excels in research environments where experimentation and reproducibility are crucial: + +- **Hyperparameter Tuning**: Easy configuration management for large parameter sweeps +- **A/B Testing**: Compare different model configurations and architectures +- **Reproducible Research**: Ensure experiments can be exactly reproduced +- **Collaborative Research**: Share configurations and results across teams + +### **Production ML Pipelines** + +For ML engineers building production systems: + +- **Configuration Management**: Version-controlled, type-safe configurations +- **Environment Consistency**: Same code runs across development, staging, and production +- **Scalability**: Scale from local development to distributed clusters +- **Monitoring**: Built-in experiment tracking and monitoring + +### **DevOps & Infrastructure** + +For teams managing ML infrastructure: + +- **Multi-Environment Support**: Seamless transitions between environments +- **Resource Optimization**: Intelligent resource allocation and cleanup +- **Integration**: Works with existing CI/CD pipelines and infrastructure +- **Cost Management**: Efficient resource utilization across platforms + +## Competitive Advantages + +### **vs. Traditional Scripts** + +| Traditional Approach | NeMo Run | +|---------------------|----------| +| Hard-coded parameters | Type-safe, versioned configurations | +| Environment-specific code | Environment-agnostic execution | +| Manual experiment tracking | Automatic metadata capture | +| Difficult reproducibility | One-command reproduction | +| Limited scalability | Built-in scaling capabilities | + +### **vs. Other ML Frameworks** + +**Configuration Management** +- **NeMo Run**: Python-based with type safety and validation +- **Others**: Often YAML/JSON with limited validation + +**Execution Flexibility** +- **NeMo Run**: Multiple backends with unified API +- **Others**: Usually tied to specific execution environments + +**Experiment Tracking** +- **NeMo Run**: Built-in tracking with full reproducibility +- **Others**: Often requires external tracking systems + +## Technical Advantages + +### **Architecture Benefits** + +- **Separation of Concerns**: Clean separation between configuration, execution, and management +- **Extensibility**: Plugin architecture for custom functionality +- **Type Safety**: Leverages Python's type system for validation +- **IDE Support**: Full autocomplete and type checking support + +### **Performance Benefits** + +- **Efficient Packaging**: Intelligent code packaging strategies +- **Resource Optimization**: Automatic resource allocation and cleanup +- **Parallel Execution**: Support for concurrent task execution +- **Caching**: Built-in caching for improved performance + +### **Developer Experience** + +- **Rich CLI**: Type-safe command-line interface with autocomplete +- **Visualization**: Built-in configuration visualization with graphviz +- **Debugging**: Comprehensive logging and debugging capabilities +- **Documentation**: Automatic documentation generation from configurations + +## Real-World Impact + +### **Research Productivity** + +- **Faster Experimentation**: Reduced time from idea to results +- **Better Collaboration**: Shared configurations and reproducible results +- **Reduced Errors**: Type safety and validation prevent configuration mistakes +- **Improved Insights**: Better tracking and analysis of experiments + +### **Operational Efficiency** + +- **Reduced Infrastructure Overhead**: Unified management across environments +- **Lower Costs**: Efficient resource utilization and automatic cleanup +- **Faster Deployment**: Streamlined deployment processes +- **Better Monitoring**: Comprehensive experiment tracking and status monitoring + +### **Team Collaboration** + +- **Shared Standards**: Consistent configuration and execution patterns +- **Knowledge Transfer**: Easy sharing of experiments and configurations +- **Code Reuse**: Reusable configuration components and patterns +- **Documentation**: Automatic documentation from configurations + +## Getting Started + +Ready to experience the benefits of NeMo Run? Start with our [installation guide](../get-started/install) and [tutorials](../get-started/tutorials) to see how NeMo Run can transform your ML workflows. + +For more detailed information about specific features, explore our [Configuration](../guides/configuration), [Execution](../guides/execution), and [Management](../guides/management) guides. diff --git a/docs/admin/cicd/gitlab-runner.md b/docs/admin/cicd/gitlab-runner.md deleted file mode 100644 index 74ae48c2..00000000 --- a/docs/admin/cicd/gitlab-runner.md +++ /dev/null @@ -1,36 +0,0 @@ -(admin-cicd-gitlab-runner)= -# GitLab Runner - -1. Click on your project and select Settings. - -2. Navigate to Settings and click on CI/CD inside this click on Expand of Runners section - -3. Click New project runner. - -4. Assign tags to control which tagged jobs will run on this runner. - - tags: pages - -5. Enter pdx-tme-002.nvidia.com in the Runner description. - -6. Leave everything else blank, click Create runner. - -7. Copy the command provided on the GitLab page. - -8. SSH to nvidia@pdx-tme-002.nvidia.com. - -9. Slack Andrew Schilling for SSH password. - -10. Run that command from the step above with sudo. - -11. Use the default GitLab instance URL and name for runner (Hit enter twice). - -12. Select Docker for executor. - -13. Select docker:23.0.6 for the default Docker image. - -14. Confirm the new runner is assigned under Assigned project runners. - -15. Deselect Instance runners. - -16. Run a test build with the updated runner. \ No newline at end of file diff --git a/docs/admin/cicd/index.md b/docs/admin/cicd/index.md deleted file mode 100644 index 502584bd..00000000 --- a/docs/admin/cicd/index.md +++ /dev/null @@ -1,10 +0,0 @@ -(admin-cicd)= -# CI/CD - -```{toctree} -:maxdepth: 4 -:titlesonly: -:hidden: - -gitlab-runner -``` diff --git a/docs/admin/deployment/index.md b/docs/admin/deployment/index.md deleted file mode 100644 index 0662f580..00000000 --- a/docs/admin/deployment/index.md +++ /dev/null @@ -1,11 +0,0 @@ -(admin-deployment)= -# Deployment Options - - -```{toctree} -:maxdepth: 4 -:titlesonly: -:hidden: - -Requirements -``` diff --git a/docs/admin/deployment/requirements.md b/docs/admin/deployment/requirements.md deleted file mode 100644 index 7c5aca09..00000000 --- a/docs/admin/deployment/requirements.md +++ /dev/null @@ -1,30 +0,0 @@ -(admin-deployment-requirements)= -# Production Deployment Requirements - -This page details the comprehensive system, hardware, and software requirements for deploying NeMo Curator in production environments. - -## System Requirements - -- one -- two -- three - -## Hardware Requirements - -### CPU Requirements - -- one -- two -- three - -### GPU Requirements (Optional but Recommended) - -- one -- two -- three - -## Software Dependencies - -- one -- two -- three \ No newline at end of file diff --git a/docs/admin/index.md b/docs/admin/index.md deleted file mode 100644 index a100274c..00000000 --- a/docs/admin/index.md +++ /dev/null @@ -1,44 +0,0 @@ ---- -description: "Configure deployment options and integrate with external systems using our comprehensive administration guides." -tags: ["admin", "deployment", "integration", "configuration"] -categories: ["administration"] ---- - -(admin-overview)= -# About Admin - -Intro text. - ---- - -## Deployment Options - - - -## Integration Options - - \ No newline at end of file diff --git a/docs/admin/integrations/index.md b/docs/admin/integrations/index.md deleted file mode 100644 index aa62bd50..00000000 --- a/docs/admin/integrations/index.md +++ /dev/null @@ -1,47 +0,0 @@ -(admin-integrations)= -# Integrations - -Use the following Admin guides to set up integrations for NeMo Curator in a production environment. - -## Prerequisites - -- TBD - ---- - -## Integration Options - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`database;1.5em;sd-mr-1` Spark -:link: admin-integrations-spark -:link-type: ref -Integrate NeMo Curator with Apache Spark for distributed processing -+++ -{bdg-secondary}`batch-processing` -{bdg-secondary}`performance` -{bdg-secondary}`optimization` -::: - -:::{grid-item-card} {octicon}`search;1.5em;sd-mr-1` Pinecone -:link: admin-integrations-pinecone -:link-type: ref -Enable semantic search for your documentation using Pinecone's hosted embeddings -+++ -{bdg-secondary}`semantic-search` -{bdg-secondary}`embeddings` -{bdg-secondary}`documentation` -::: - -:::: - -```{toctree} -:maxdepth: 4 -:titlesonly: -:hidden: - -Spark -Pinecone - -``` diff --git a/docs/admin/integrations/pinecone.md b/docs/admin/integrations/pinecone.md deleted file mode 100644 index 7e9fca46..00000000 --- a/docs/admin/integrations/pinecone.md +++ /dev/null @@ -1,511 +0,0 @@ -(admin-integrations-pinecone)= -# Pinecone Search Integration - -Upload your Sphinx documentation to Pinecone for semantic search capabilities using hosted embeddings. - -## Background - -This integration uses **Pinecone's hosted embeddings** with the `llama-text-embed-v2` model to automatically generate high-quality 1024-dimensional embeddings from your documentation content. The system processes your documentation's `index.json` file (generated during the build process) and uploads structured content to Pinecone for semantic search. - -## Prerequisites - -Before setting up the Pinecone integration, ensure you have: - -1. **Pinecone Account**: Active account at [pinecone.io](https://pinecone.io) -2. **Pinecone Index**: Configured with the following specifications: - - **Dimensions**: 1024 - - **Metric**: cosine - - **Model**: llama-text-embed-v2 (hosted) -3. **Environment Variable**: `PINECONE_API_KEY` set in your environment -4. **Documentation Build**: Your Sphinx documentation built with the index extension - -## Quick Start - -### 1. Test Your Setup - -First, validate your Pinecone connection and configuration: - -```bash -python scripts/test_pinecone_setup.py -``` - -This script verifies: -- Pinecone API connection -- Index configuration (dimensions, hosted embeddings) -- Documentation index file availability -- Overall setup readiness - -### 2. Preview Upload (Dry Run) - -Before uploading, preview what will be sent to Pinecone: - -```bash -python scripts/send_to_pinecone_simple.py --dry-run --namespace docs-content -``` - -### 3. Upload Documentation - -Upload your documentation to Pinecone: - -```bash -python scripts/send_to_pinecone_simple.py --namespace docs-content -``` - -### 4. Test Search Functionality - -Once uploaded, test semantic search: - -```bash -python scripts/query_pinecone_example.py -``` - -### 5. Using Make Commands - -For streamlined workflows, use the provided Make targets: - -```bash -# Test Pinecone connection -make docs-pinecone-test - -# Build documentation and upload to Pinecone -make docs-pinecone-update PINECONE_ARGS="--namespace docs-content" - -# Upload only (without rebuilding docs) -make docs-pinecone-upload PINECONE_ARGS="--namespace docs-content" - -# Preview mode -make docs-pinecone-upload-dry PINECONE_ARGS="--namespace docs-content" -``` - -## How It Works - -### Document Processing Pipeline - -1. **Documentation Build**: Sphinx generates comprehensive `index.json` with content and metadata -2. **Content Extraction**: Script processes the JSON structure and extracts text content -3. **Pinecone Upload**: Documents sent to Pinecone with metadata using `upsert_records()` -4. **Hosted Embeddings**: Pinecone automatically generates 1024-dimensional embeddings using `llama-text-embed-v2` -5. **Storage**: Vectors stored in your specified namespace for semantic search - -### Key Features - -- **Hosted Embeddings**: No local model downloads or GPU requirements. Pinecone handles embedding generation automatically. -- **Optimal Dimensions**: Perfect 1024-dimensional vectors match your index configuration without compatibility issues. -- **Fast Processing**: Batch uploads and efficient processing handle large documentation sets quickly. -- **Namespace Organization**: Documents organized by namespace for logical separation and management. - -## Searching Your Documentation - -### Basic Search Example - -```python -import os -from pinecone import Pinecone - -# Initialize Pinecone -pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY')) -index = pc.Index("docs-site-demo-starter-kit") - -# Search using hosted embeddings -results = index.search( - namespace="docs-content", - query={ - "inputs": {"text": "How do I integrate with Spark?"}, - "top_k": 5 - } -) - -# Process results -for hit in results.result.hits: - print(f"Score: {hit['_score']:.4f}") - print(f"Title: {hit['fields']['title']}") - print(f"URL: {hit['fields']['url']}") - print(f"Summary: {hit['fields'].get('summary', '')[:200]}...") - print() -``` - -### Search Response Format - -The hosted embeddings API returns results in this format: - -```python -{ - 'result': { - 'hits': [ - { - '_id': 'document-identifier', - '_score': 0.8234, # Relevance score (0-1) - 'fields': { - 'title': 'Document Title', - 'url': 'path/to/document.html', - 'format': 'text', - 'content': 'Full document content...', - 'summary': 'Document summary...', - 'headings': 'Section headings...' - } - } - ] - }, - 'usage': { - 'embed_total_tokens': 5, - 'read_units': 6 - } -} -``` - -### Advanced Search Function - -```python -def search_docs(query: str, top_k: int = 5, namespace: str = "docs-content"): - """ - Search documentation with error handling and clean output. - """ - pc = Pinecone(api_key=os.getenv('PINECONE_API_KEY')) - index = pc.Index("docs-site-demo-starter-kit") - - try: - results = index.search( - namespace=namespace, - query={ - "inputs": {"text": query}, - "top_k": top_k - } - ) - - # Extract and format results - search_results = [] - for hit in results.result.hits: - fields = hit.get('fields', {}) - search_results.append({ - 'score': hit.get('_score', 0), - 'title': fields.get('title', 'N/A'), - 'url': fields.get('url', 'N/A'), - 'summary': fields.get('summary', '')[:200], - 'content_preview': fields.get('content', '')[:150] - }) - - return search_results - - except Exception as e: - print(f"Search error: {e}") - return [] - -# Usage -results = search_docs("Apache Spark integration", top_k=3) -for result in results: - print(f"📄 {result['title']} (Score: {result['score']:.4f})") - print(f" {result['summary']}") - print() -``` - -## Configuration - -### Pinecone Index Setup - -Ensure your Pinecone index meets these requirements: - -| Setting | Value | -|---------|-------| -| Dimensions | 1024 (matches llama-text-embed-v2) | -| Metric | cosine | -| Model | llama-text-embed-v2 (hosted) | -| Type | Serverless (recommended) | - -### Environment Configuration - -Set your Pinecone API key as an environment variable: - -```bash -export PINECONE_API_KEY="your-api-key-here" -``` - -For persistent configuration, add to your shell profile (`.bashrc`, `.zshrc`, etc.): - -```bash -echo 'export PINECONE_API_KEY="your-api-key-here"' >> ~/.zshrc -source ~/.zshrc -``` - -### Document Metadata Structure - -Each document uploaded to Pinecone includes structured metadata accessible in search results: - -```json -{ - "title": "Document Title", - "url": "path/to/document.html", - "format": "text", - "content": "Full document text content...", - "summary": "Brief document summary...", - "headings": "Section | Subsection | Topic", - "last_modified": "2024-01-15", - "author": "Author Name", - "tags": "tag1, tag2", - "categories": "category", - "description": "Document description" -} -``` - -## Available Scripts - -| Script | Purpose | Usage | -|--------|---------|-------| -| `test_pinecone_setup.py` | Validate connection and configuration | `python scripts/test_pinecone_setup.py` | -| `send_to_pinecone_simple.py` | Upload documentation to Pinecone | `python scripts/send_to_pinecone_simple.py --namespace docs-content` | -| `query_pinecone_example.py` | Test search functionality | `python scripts/query_pinecone_example.py` | - -## Command Reference - -### Test Script Options - -```bash -python scripts/test_pinecone_setup.py [OPTIONS] -``` - -```{list-table} Test Script Options -:widths: 25 25 50 -:header-rows: 1 - -* - Option - - Default - - Description -* - `--index-name` - - `docs-site-demo-starter-kit` - - Pinecone index name -* - `--index-file` - - `docs/_build/html/index.json` - - Path to documentation index file -``` - -### Upload Script Options - -```bash -python scripts/send_to_pinecone_simple.py [OPTIONS] -``` - -```{list-table} Upload Script Options -:widths: 25 25 50 -:header-rows: 1 - -* - Option - - Default - - Description -* - `--index-file` - - `docs/_build/html/index.json` - - Path to documentation index file -* - `--index-name` - - `docs-site-demo-starter-kit` - - Pinecone index name -* - `--namespace` - - Required - - Pinecone namespace for documents -* - `--dry-run` - - `false` - - Preview without uploading -* - `--batch-size` - - `50` - - Documents per batch upload -``` - -### Make Targets - -```{list-table} Available Make Targets -:widths: 35 65 -:header-rows: 1 - -* - Target - - Description -* - `docs-pinecone-test` - - Test Pinecone connection and configuration -* - `docs-pinecone-update` - - Build documentation and upload to Pinecone -* - `docs-pinecone-upload` - - Upload to Pinecone (no documentation build) -* - `docs-pinecone-upload-dry` - - Preview upload without sending to Pinecone -``` - -## Troubleshooting - -### Common Issues - -**Index Not Found** - -Error: `Index 'your-index-name' not found` - -Solutions: -- Verify index name in script matches Pinecone console -- Check `PINECONE_API_KEY` environment variable -- Confirm index exists in your Pinecone project - -**Dimension Mismatch** - -Error: `Dimension mismatch: expected 1024, got XXX` - -Solutions: -- Ensure index configured for 1024 dimensions -- Verify hosted embeddings (llama-text-embed-v2) enabled -- Recreate index with correct dimensions if needed - -**API Key Invalid** - -Error: `Authentication failed` or `Invalid API key` - -Solutions: -- Verify `PINECONE_API_KEY` environment variable set -- Check API key in Pinecone console -- Ensure key has appropriate permissions - -**Index File Missing** - -Error: `Index file not found: docs/_build/html/index.json` - -Solutions: -- Build documentation first: `make docs-html` -- Verify index extension enabled in Sphinx configuration -- Check file path and permissions - -**No Search Results** - -Error: Search returns empty results despite having uploaded documents - -Solutions: -- Verify correct namespace in search query -- Check that documents were uploaded successfully -- Try broader search terms -- Confirm namespace exists: `python scripts/test_pinecone_setup.py` - -### Debug Commands - -For detailed troubleshooting, use these diagnostic commands: - -```bash -# Full connection and configuration test -python scripts/test_pinecone_setup.py --index-name your-index-name - -# Preview upload with custom settings -python scripts/send_to_pinecone_simple.py --dry-run --namespace test --batch-size 10 - -# Test search functionality -python scripts/query_pinecone_example.py - -# Check environment variables -echo $PINECONE_API_KEY - -# Verify index file exists -ls -la docs/_build/html/index.json -``` - -## Performance and Monitoring - -### Upload Performance - -Typical performance metrics for the integration: - -- **Upload Speed**: 30-50 documents per second -- **Batch Processing**: Efficient handling of large documentation sets -- **No Local Compute**: All embedding generation happens on Pinecone servers -- **Automatic Retries**: Built-in error handling and retry logic - -### Search Performance - -- **Query Speed**: Sub-second response times for most queries -- **Hosted Embeddings**: No local embedding computation required -- **Scalable**: Handles concurrent searches efficiently -- **Usage Tracking**: Monitor embedding tokens and read units - -### Success Indicators - -When the integration works correctly, you should see: - -- ✅ Connection test passes without errors -- ✅ Index dimensions confirmed as 1024 -- ✅ Documents upload with zero failures -- ✅ Namespace appears in Pinecone console -- ✅ Vector count increases in index statistics -- ✅ Search queries return relevant results with good scores (>0.2) - -### Monitoring Uploads - -The upload script provides detailed progress information: - -``` -🔗 Testing Pinecone connection... -✅ Connected to index: docs-site-demo-starter-kit -📊 Index stats: - - Total vectors: 37 - - Dimension: 1024 - - Namespaces: - - docs-content: 37 vectors - -🚀 Starting upload to Pinecone... -📄 Found 37 documents to upload -⏳ Processing batch 1/1 (37 documents) -✅ Successfully uploaded 37 documents (0 failures) -🎉 Upload completed successfully! -``` - -## Integration with Documentation Workflow - -### CI/CD Integration - -Add Pinecone upload to your documentation deployment pipeline: - -```yaml -# Example GitHub Actions workflow -- name: Build Documentation - run: make docs-html - -- name: Upload to Pinecone - run: make docs-pinecone-upload PINECONE_ARGS="--namespace docs-content" - env: - PINECONE_API_KEY: ${{ secrets.PINECONE_API_KEY }} -``` - -### Website Integration - -Add search to your documentation website: - -```html - -
- -
-
- - -``` - -### Automated Updates - -For regularly updated documentation, set up automated uploads: - -```bash -# Example cron job - update search index every 6 hours -0 */6 * * * cd /path/to/docs && make docs-pinecone-update PINECONE_ARGS="--namespace docs-content" -``` - -This integration provides a seamless way to enable semantic search capabilities for your documentation, leveraging Pinecone's advanced embedding models and vector search infrastructure. \ No newline at end of file diff --git a/docs/admin/integrations/spark.md b/docs/admin/integrations/spark.md deleted file mode 100644 index ac9719cc..00000000 --- a/docs/admin/integrations/spark.md +++ /dev/null @@ -1,85 +0,0 @@ -(admin-integrations-spark)= -# Reading and Writing Datasets with NeMo Curator and Apache Spark - -## Background - -NeMo Curator uses the `DocumentDataset` class to read and write JSONL and Parquet files. It's a wrapper around a [Dask (or Dask-cuDF) DataFrame](https://docs.dask.org/en/stable/dataframe.html). Apache Spark can read and write JSONL and Parquet files generated by NeMo Curator, and similarly, NeMo Curator can work with the outputs generated by Spark. - -## Usage - -To demonstrate how this would work, consider the following example: - -```python -import dask.dataframe as dd -import pandas as pd -from nemo_curator.datasets import DocumentDataset - -# Create sample data -data = { - "id": [1, 2, 3], - "text": [ - "This is a tiny story.", - "Another tiny story appears here.", - "Yet another tiny story for you." - ] -} - -# Convert to a pandas DataFrame first -df = pd.DataFrame(data) - -# Convert pandas DataFrame to DocumentDataset -stories_ds = DocumentDataset(dd.from_pandas(df, npartitions=2)) - -# Write the dataset to JSONL files -stories_ds.to_json("tiny_stories/", write_to_filename=False) -``` - -This will create two JSONL files in the directory `tiny_stories/`: - -``` -tiny_stories/ - 0.part - 1.part -``` - -Apache Spark can read these files using standard application programming interfaces (APIs). Let's first create a Spark session called `NeMoCuratorExample`, then we can read files in the directory using: - -```python -from pyspark.sql import SparkSession - -spark = SparkSession.builder.appName("NeMoCuratorExample").getOrCreate() - -# Reading JSONL file -stories_df = spark.read.json("tiny_stories") -stories_df.show() -``` - -Let's go ahead and add a couple of columns to the Spark DataFrame: - -```python -from pyspark.sql.functions import size, split, length - -# Calculate Word Count -stories_df = stories_df.withColumn("WordCount", size(split(stories_df["text"], r"\s+"))) - -# Calculate Character Count -stories_df = stories_df.withColumn("CharacterCount", length(stories_df["text"])) - -stories_df.write.mode("overwrite").parquet("tiny_stories_transformed") -``` - -To connect between NeMo Curator `DocumentDataset` and Spark DataFrames, we recommend using Parquet files for data exchange. The following code snippet demonstrates how to read output from a Spark DataFrame into a NeMo Curator `DocumentDataset`: - -```python -from nemo_curator.utils.file_utils import get_all_files_paths_under - -# Ignores checksum and marker files created by Spark -processed_files = [ - filename for filename in get_all_files_paths_under("tiny_stories_transformed") - if not (filename.endswith(".crc") or filename.endswith("_SUCCESS")) -] - -stories_dataset = DocumentDataset.read_parquet(processed_files, backend="pandas") -``` - -It's worth noting that Spark typically tends to create checksum and other marker files which can vary by Spark distribution, so it's advisable to ignore them when reading data into a NeMo Curator `DocumentDataset`. \ No newline at end of file diff --git a/docs/conf.py b/docs/conf.py index 4d14b4ac..68e0bd15 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -26,7 +26,7 @@ # Add custom extensions directory to Python path sys.path.insert(0, os.path.abspath('_extensions')) -project = "NVIDIA-Docs-Template" +project = "NeMo Run" copyright = "2025, NVIDIA Corporation" author = "NVIDIA Corporation" release = "0.0.1" @@ -34,6 +34,9 @@ # -- General configuration --------------------------------------------------- # https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration +# Set the master document to our markdown index file +master_doc = "nemo-run-index" + extensions = [ "myst_parser", # For our markdown docs # "autodoc2" - Added conditionally below based on package availability @@ -43,7 +46,7 @@ "sphinx_copybutton", # For copy button in code blocks, "sphinx_design", # For grid layout "sphinx.ext.ifconfig", # For conditional content - "content_gating", # Unified content gating extension + "content_gating", # Unified content gating extension "myst_codeblock_substitutions", # Our custom MyST substitutions in code blocks "json_output", # Generate JSON output for each page "search_assets", # Enhanced search assets extension @@ -54,8 +57,8 @@ templates_path = ["_templates"] exclude_patterns = [ - "_build", - "Thumbs.db", + "_build", + "Thumbs.db", ".DS_Store", "_extensions/*/README.md", # Exclude README files in extension directories "_extensions/README.md", # Exclude main extensions README @@ -111,8 +114,8 @@ # MyST substitutions for reusable variables across documentation myst_substitutions = { - "product_name": "NVIDIA Docs Starter Kit", - "product_name_short": "Docs Starter Kit", + "product_name": "NVIDIA NeMo Run", + "product_name_short": "NeMo Run", "company": "NVIDIA", "version": release, "current_year": "2025", @@ -164,7 +167,7 @@ if autodoc2_packages: if "autodoc2" not in extensions: extensions.append("autodoc2") - + autodoc2_render_plugin = "myst" # Use MyST for rendering docstrings autodoc2_output_dir = "apidocs" # Output directory for autodoc2 (relative to docs/) # This is a workaround that uses the parser located in autodoc2_docstrings_parser.py to allow autodoc2 to @@ -215,10 +218,10 @@ }, } -# Add our static files directory +# Add our static files directory # html_static_path = ["_static"] html_extra_path = ["project.json", "versions1.json"] -# Note: JSON output configuration has been moved to the consolidated -# json_output_settings dictionary above for better organization and new features! \ No newline at end of file +# Note: JSON output configuration has been moved to the consolidated +# json_output_settings dictionary above for better organization and new features! diff --git a/docs/feature-set-a/category-a/advanced-patterns.md b/docs/feature-set-a/category-a/advanced-patterns.md deleted file mode 100644 index c4b247a0..00000000 --- a/docs/feature-set-a/category-a/advanced-patterns.md +++ /dev/null @@ -1,525 +0,0 @@ -(feature-set-a-advanced-patterns)= -# Advanced MyST Patterns - -This document showcases sophisticated MyST markdown patterns and features for creating rich, professional documentation. - -(feature-set-a-advanced-patterns-figures)= -## Figures & Media - -Demonstrate advanced figure handling, captions, and cross-references. - -### Basic Figure with Caption - -```{figure} https://placehold.co/600x400/png -:alt: System architecture showing microservices communication -:width: 600px -:align: center -:name: fig-architecture - -System Architecture Overview - This diagram illustrates the communication flow between microservices in our distributed system. -``` - -As shown in {numref}`fig-architecture`, the system uses an event-driven architecture. - -### Responsive Figure Grid - -::::{grid} 1 1 2 2 -:gutter: 2 - -:::{grid-item} -```{figure} https://placehold.co/300x400/28a745/ffffff?text=Mobile+UI -:alt: Mobile dashboard interface -:width: 100% -:name: fig-mobile - -Mobile Interface - Optimized for touch interactions -``` -::: - -:::{grid-item} -```{figure} https://placehold.co/400x300/007bff/ffffff?text=Desktop+UI -:alt: Desktop dashboard interface -:width: 100% -:name: fig-desktop - -Desktop Interface - Full-featured admin panel -``` -::: -:::: - -Compare the mobile interface ({numref}`fig-mobile`) with the desktop version ({numref}`fig-desktop`) to see responsive design principles in action. - -(feature-set-a-advanced-patterns-math)= -## Mathematical Expressions - -Showcase mathematical notation using MyST's math support. - -### Inline Mathematics - -The algorithm achieves {math}`O(n \log n)` time complexity, where {math}`n` represents the input size. - -### Block Mathematics - -```{math} -:label: eq-performance - -\text{Response Time} = \frac{\text{Queue Length}}{\text{Service Rate}} + \text{Processing Time} -``` - -The performance equation {eq}`eq-performance` helps predict system behavior under load. - -### Complex Mathematical Notation - -```{math} -:label: eq-machine-learning - -J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta_j^2 -``` - -The cost function {eq}`eq-machine-learning` includes L2 regularization to prevent overfitting. - -### Mathematical Derivation Steps - -::::{dropdown} Mathematical Proof -:icon: book -:color: info - -**Theorem:** The gradient descent update rule minimizes the cost function. - -**Proof:** - -Starting with the cost function: -```{math} -J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2 -``` - -Taking the partial derivative with respect to {math}`\theta_j`: -```{math} -\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} -``` - -The gradient descent update becomes: -```{math} -\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) -``` - -Therefore: -```{math} -\theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} -``` - -This update rule moves {math}`\theta_j` in the direction of steepest descent. □ -:::: - -(feature-set-a-advanced-patterns-glossary)= -## Glossary & Definitions - -Create searchable glossaries and definition lists. - -```{glossary} -API - Application Programming Interface - a set of protocols and tools for building software applications. - -Microservice - An architectural approach where a single application is composed of many loosely coupled services. - -JWT - JSON Web Token - a compact, URL-safe means of representing claims between two parties. - -Load Balancer - A device or software that distributes network traffic across multiple servers. - -Container - A lightweight, portable execution environment that includes everything needed to run an application. - -Kubernetes - An open-source container orchestration platform for automating deployment and management. -``` - -### Using Glossary Terms - -When building modern applications, you'll often use {term}`API`s to enable communication between {term}`Microservice`s. Authentication is typically handled using {term}`JWT` tokens, while a {term}`Load Balancer` distributes traffic across multiple {term}`Container` instances managed by {term}`Kubernetes`. - -### Definition Lists - -**Authentication Methods** -: Various approaches to verify user identity - -**Authorization Levels** -: Different permission tiers for system access - -**Rate Limiting** -: Controlling the frequency of API requests - -**Circuit Breaker** -: Pattern to prevent cascading failures in distributed systems - -(feature-set-a-advanced-patterns-footnotes)= -## Footnotes & Citations - -Demonstrate scholarly citation patterns. - -### Research Citations - -Modern distributed systems rely on eventual consistency models[^1] to achieve high availability across geographically distributed data centers. The CAP theorem[^2] proves that it's impossible to simultaneously guarantee consistency, availability, and partition tolerance. - -Performance optimization in microservices architectures often involves implementing circuit breaker patterns[^3] and bulkhead isolation[^4] to prevent cascading failures. - -[^1]: Werner Vogels, "Eventually Consistent," Communications of the ACM, vol. 52, no. 1, pp. 40-44, 2009. - -[^2]: Eric Brewer, "CAP Twelve Years Later: How the Rules Have Changed," Computer, vol. 45, no. 2, pp. 23-29, 2012. - -[^3]: Michael Nygard, "Release It! Design and Deploy Production-Ready Software," Pragmatic Bookshelf, 2018. - -[^4]: Netflix Technology Blog, "Fault Tolerance in a High Volume, Distributed System," https://netflixtechblog.com/fault-tolerance-in-a-high-volume-distributed-system-91ab4faae74a - -### Inline Citations - -According to the latest performance benchmarks[^benchmark], our optimization improvements resulted in a 300% increase in throughput. - -[^benchmark]: Internal Performance Report Q3 2024, Engineering Team Analysis - -(feature-set-a-advanced-patterns-code-execution)= -## Executable Code Blocks - -Show code examples with execution results and interactive elements. - -### Python Code with Output - -```{code-block} python -:linenos: -:emphasize-lines: 3,7 - -def fibonacci(n): - """Calculate nth Fibonacci number using dynamic programming.""" - if n <= 1: - return n - - a, b = 0, 1 - for i in range(2, n + 1): - a, b = b, a + b - - return b - -# Example usage -result = fibonacci(10) -print(f"The 10th Fibonacci number is: {result}") -``` - -**Output:** -``` -The 10th Fibonacci number is: 55 -``` - -### Multi-Language Code Comparison - -::::{tab-set} - -:::{tab-item} Python -:sync: lang-compare - -```{code-block} python -:caption: Binary search implementation - -def binary_search(arr, target): - left, right = 0, len(arr) - 1 - - while left <= right: - mid = (left + right) // 2 - if arr[mid] == target: - return mid - elif arr[mid] < target: - left = mid + 1 - else: - right = mid - 1 - - return -1 - -# Time complexity: O(log n) -# Space complexity: O(1) -``` -::: - -:::{tab-item} JavaScript -:sync: lang-compare - -```{code-block} javascript -:caption: Binary search implementation - -function binarySearch(arr, target) { - let left = 0; - let right = arr.length - 1; - - while (left <= right) { - const mid = Math.floor((left + right) / 2); - if (arr[mid] === target) { - return mid; - } else if (arr[mid] < target) { - left = mid + 1; - } else { - right = mid - 1; - } - } - - return -1; -} - -// Time complexity: O(log n) -// Space complexity: O(1) -``` -::: - -:::{tab-item} Go -:sync: lang-compare - -```{code-block} go -:caption: Binary search implementation - -func binarySearch(arr []int, target int) int { - left, right := 0, len(arr)-1 - - for left <= right { - mid := (left + right) / 2 - if arr[mid] == target { - return mid - } else if arr[mid] < target { - left = mid + 1 - } else { - right = mid - 1 - } - } - - return -1 -} - -// Time complexity: O(log n) -// Space complexity: O(1) -``` -::: -:::: - -(feature-set-a-advanced-patterns-cross-refs)= -## Advanced Cross References - -Demonstrate sophisticated internal linking and reference systems. - -### Section References - -- Architecture overview: {ref}`feature-set-a-advanced-patterns-figures` -- Mathematical foundations: {ref}`feature-set-a-advanced-patterns-math` -- Code implementations: {ref}`feature-set-a-advanced-patterns-code-execution` - -### Document References - -See our related documentation: -- {doc}`topic-a/index` - Core concepts and comparisons -- {doc}`topic-a/subtopic-a` - Tab-based examples and API usage -- {doc}`topic-a/subtopic-b` - Comprehensive MyST pattern showcase - -### External References - -```{seealso} -**Related Documentation:** -- {ref}`section-category-topic` - Topic overview with comparison tables -- {ref}`feature-set-a-tuts-series-a` - Interactive tutorial series -- {ref}`feature-set-a-tutorials-beginner` - Getting started guide - -**External Resources:** -- [MyST Parser Syntax Guide](https://myst-parser.readthedocs.io/en/latest/syntax/syntax.html) -- [Sphinx Design Documentation](https://sphinx-design.readthedocs.io/) -``` - -(feature-set-a-advanced-patterns-directives)= -## Custom Directives & Roles - -Showcase specialized MyST directives for enhanced content presentation. - -### Version Information - -```{versionadded} 2.1 -Support for advanced mathematical notation in code blocks. -``` - -```{versionchanged} 2.0 -The authentication system now supports OAuth 2.0 and JWT tokens. -``` - -```{deprecated} 1.8 -The legacy authentication method will be removed in version 3.0. Use JWT tokens instead. -``` - -### Todo Items - -:::{note} -**Development Task:** Add performance benchmarks for the new caching layer implementation. -::: - -:::{note} -**Documentation Task:** Update the API documentation with new endpoint descriptions. -::: - -### Code Documentation - -```{function} calculate_performance_metrics(requests, duration) -Calculate key performance indicators for API endpoints. - -:param requests: List of HTTP request objects -:type requests: List[Request] -:param duration: Time window for analysis in seconds -:type duration: int -:returns: Dictionary containing performance metrics -:rtype: Dict[str, float] - -**Example usage:** - -.. code-block:: python - - metrics = calculate_performance_metrics(request_log, 3600) - print(f"Average response time: {metrics['avg_response_time']}ms") -``` - -### Content Blocks - -```{contents} Page Contents -:local: -:depth: 2 -``` - -### Index Entries - -```{index} single: Performance; Optimization -``` - -```{index} pair: API; Authentication -``` - -Performance optimization {index}`techniques ` are essential for scalable applications. - -(feature-set-a-advanced-patterns-accessibility)= -## Accessibility Features - -Demonstrate inclusive design patterns in documentation. - -### Screen Reader Friendly Tables - -```{list-table} API Endpoint Performance Metrics -:header-rows: 1 -:stub-columns: 1 -:widths: 25 20 20 20 15 -:name: table-performance - -* - Endpoint - - Avg Response (ms) - - 95th Percentile (ms) - - Requests/sec - - Error Rate (%) -* - **GET /users** - - 45 - - 120 - - 1,200 - - 0.1 -* - **POST /users** - - 85 - - 200 - - 800 - - 0.3 -* - **GET /analytics** - - 150 - - 400 - - 500 - - 0.2 -* - **POST /data-export** - - 2,500 - - 5,000 - - 50 - - 1.2 -``` - -Reference {numref}`table-performance` for detailed performance characteristics of each endpoint. - -### Alt Text for Visual Elements - -::::{grid} 1 1 2 2 -:gutter: 2 - -:::{grid-item-card} {octicon}`accessibility;1.5em;sd-mr-1` Accessibility First -:class-header: sd-bg-success sd-text-white - -All visual elements include descriptive alternative text for screen readers. -+++ -{bdg-success}`WCAG 2.1 AA` {bdg-secondary}`Screen Reader Tested` -::: - -:::{grid-item-card} {octicon}`device-mobile;1.5em;sd-mr-1` Mobile Optimized -:class-header: sd-bg-primary sd-text-white - -Responsive design ensures content accessibility across all device sizes. -+++ -{bdg-primary}`Touch Friendly` {bdg-secondary}`High Contrast` -::: -:::: - -### Keyboard Navigation - -:::{note} -All interactive elements in this documentation support keyboard navigation: -- **Tab**: Move to next interactive element -- **Shift+Tab**: Move to previous interactive element -- **Enter/Space**: Activate buttons and links -- **Arrow keys**: Navigate within tab sets and dropdowns -::: - -### Color and Contrast - -::::{grid} 1 2 2 2 -:gutter: 2 - -:::{grid-item} -:class: sd-bg-success sd-text-white sd-p-3 sd-text-center - -**Success State** -High contrast ratio: 7.2:1 -::: - -:::{grid-item} -:class: sd-bg-warning sd-text-dark sd-p-3 sd-text-center - -**Warning State** -High contrast ratio: 8.1:1 -::: - -:::{grid-item} -:class: sd-bg-danger sd-text-white sd-p-3 sd-text-center - -**Error State** -High contrast ratio: 12.3:1 -::: - -:::{grid-item} -:class: sd-bg-info sd-text-white sd-p-3 sd-text-center - -**Info State** -High contrast ratio: 6.8:1 -::: -:::: - -All color combinations meet WCAG 2.1 AA contrast requirements (minimum 4.5:1 ratio). - ---- - -:::{seealso} -**Pattern Summary:** -This document demonstrates advanced MyST markdown patterns including: -- Mathematical notation and equations with cross-references -- Sophisticated figure handling and responsive layouts -- Glossary integration and definition lists -- Scholarly citations and footnotes -- Multi-language code examples with syntax highlighting -- Accessibility-focused design patterns -- Custom directives for version control and documentation - -**Next Steps:** -- {ref}`section-category-topic-subtopic-b` - Additional MyST pattern examples -- {ref}`feature-set-a-tuts-series-a` - Interactive learning experiences -- {doc}`../tutorials/index` - Hands-on tutorial collection -::: diff --git a/docs/feature-set-a/category-a/index.md b/docs/feature-set-a/category-a/index.md deleted file mode 100644 index 64e18d54..00000000 --- a/docs/feature-set-a/category-a/index.md +++ /dev/null @@ -1,73 +0,0 @@ -(feature-set-a-category-a)= -# Category A - - Introductory text for this category. - -(feature-set-a-category-a-how-it-works)= -## How it Works - -High-level overview pertaining to this subsection's contents. - ---- - -(feature-set-a-category-a-topic-a)= -## Topic A - -Short description about the articles found in Topic A. - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Subtopic A -:link: section-category-topic-subtopic-a -:link-type: ref -:link-alt: screen reader description for this link. -description of this article. -+++ -{bdg-secondary}`tag` -{bdg-secondary}`tag` -{bdg-secondary}`tag` -::: - -:::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Subtopic B -:link: section-category-topic-subtopic-b -:link-type: ref -:link-alt: screen reader description for this link. -description of this article. -+++ -{bdg-secondary}`tag` -{bdg-secondary}`tag` -{bdg-secondary}`tag` -::: - -:::: - ---- - -(feature-set-a-category-a-advanced-patterns)= -## Advanced MyST Patterns - -Comprehensive showcase of sophisticated MyST markdown techniques for technical writers. - -::::{grid} 1 1 1 1 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`telescope;1.5em;sd-mr-1` Advanced Patterns Reference -:link: feature-set-a-advanced-patterns -:link-type: ref -:link-alt: Comprehensive MyST markdown patterns and techniques reference -Complete guide to sophisticated MyST features including mathematical notation, figures, glossaries, and accessibility patterns. -+++ -{bdg-info}`Advanced` {bdg-secondary}`MyST` {bdg-secondary}`Reference` -::: - -:::: - -```{toctree} -:maxdepth: 4 -:titlesonly: -:hidden: - -Topic A -Advanced MyST Patterns -``` diff --git a/docs/feature-set-a/category-a/topic-a/index.md b/docs/feature-set-a/category-a/topic-a/index.md deleted file mode 100644 index 16b6f6b6..00000000 --- a/docs/feature-set-a/category-a/topic-a/index.md +++ /dev/null @@ -1,77 +0,0 @@ -(section-category-topic)= -# Topic A - -Short active-voice introduction for this subsection. - -(section-category-topic-how-it-works)= -## How it Works - -High-level overview pertaining to this subsection's contents. - - -### Comparison - -```{list-table} Subtopic A vs Subtopic B Comparison -:header-rows: 1 -:widths: 30 35 35 - -* - Feature - - Subtopic A - - Subtopic B -* - **Primary Use Case** - - Real-time data processing and filtering - - Batch data analysis and transformation -* - **Performance** - - Optimized for low latency operations - - Optimized for high throughput processing -* - **Resource Requirements** - - Moderate CPU, high memory usage - - High CPU, moderate memory usage -* - **Scalability** - - Horizontal scaling with auto-balancing - - Vertical scaling with manual configuration -* - **Best For** - - Streaming data pipelines - - Large dataset processing workflows -* - **Integration Complexity** - - Simple API integration - - Requires additional setup and configuration -``` - -## Options - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Subtopic A -:link: section-category-topic-subtopic-a -:link-type: ref -:link-alt: screen reader description for this link. -description of this article. -+++ -{bdg-secondary}`tag` -{bdg-secondary}`tag` -{bdg-secondary}`tag` -::: - -:::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Subtopic B -:link: section-category-topic-subtopic-b -:link-type: ref -:link-alt: screen reader description for this link. -description of this article. -+++ -{bdg-secondary}`tag` -{bdg-secondary}`tag` -{bdg-secondary}`tag` -::: - -:::: - -```{toctree} -:maxdepth: 4 -:titlesonly: -:hidden: - -Subtopic A -Subtopic B -``` \ No newline at end of file diff --git a/docs/feature-set-a/category-a/topic-a/subtopic-a.md b/docs/feature-set-a/category-a/topic-a/subtopic-a.md deleted file mode 100644 index fbef972c..00000000 --- a/docs/feature-set-a/category-a/topic-a/subtopic-a.md +++ /dev/null @@ -1,178 +0,0 @@ -(section-category-topic-subtopic-a)= -# Subtopic A - - -## Tabs - -::::{tab-set} - -:::{tab-item} cURL -:sync: s-curl - -```bash -# Basic API request example -curl -X POST \ - https://api.example.com/v1/process \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer YOUR_API_KEY" \ - -d '{ - "batch_size": 1000, - "timeout": "30s", - "enable_logging": true, - "output_format": "json", - "data": [ - {"id": 1, "content": "Sample data"}, - {"id": 2, "content": "More sample data"} - ] - }' -``` - -::: - - -:::{tab-item} Python -:sync: s-python - -```python -import requests -import json - -# Configuration -api_url = "https://api.example.com/v1/process" -api_key = "YOUR_API_KEY" - -# Request headers -headers = { - "Content-Type": "application/json", - "Authorization": f"Bearer {api_key}" -} - -# Request payload -payload = { - "batch_size": 1000, - "timeout": "30s", - "enable_logging": True, - "output_format": "json", - "data": [ - {"id": 1, "content": "Sample data"}, - {"id": 2, "content": "More sample data"} - ] -} - -# Make the request -response = requests.post(api_url, headers=headers, json=payload) - -# Handle the response -if response.status_code == 200: - result = response.json() - print("Success:", result) -else: - print(f"Error {response.status_code}: {response.text}") -``` - -::: -:::: - - -### Synced Tabs with Variables - -Adding `:sync:` with a matching value enables syncing. - -::::{tab-set} - -:::{tab-item} cURL -:sync: s-curl - -```bash -# Basic API request example -curl -X POST \ - https://api.example.com/v1/process \ - -H "Content-Type: application/json" \ - -H "Authorization: Bearer YOUR_API_KEY" \ - -d '{ - "batch_size": 1000, - "timeout": "30s", - "enable_logging": true, - "output_format": "json", - "data": [ - {"id": 1, "content": "{{ product_name }}"}, - {"id": 2, "content": "{{ version }}"} - ] - }' -``` - -::: - - -:::{tab-item} Python -:sync: s-python - -```python -import requests -import json - -# Configuration -api_url = "https://api.example.com/v1/process" -api_key = "YOUR_API_KEY" - -# Request headers -headers = { - "Content-Type": "application/json", - "Authorization": f"Bearer {api_key}" -} - -# Request payload -payload = { - "batch_size": 1000, - "timeout": "30s", - "enable_logging": True, - "output_format": "json", - "data": [ - {"id": 1, "content": "{{ product_name }}"}, - {"id": 2, "content": "{{ product_name }}"} - ] -} - -# Make the request -response = requests.post(api_url, headers=headers, json=payload) - -# Handle the response -if response.status_code == 200: - result = response.json() - print("Success:", result) -else: - print(f"Error {response.status_code}: {response.text}") -``` - -::: -:::: - -## Demo Parameters - -List tables enable you to control the individual `:widths:` of your columns. - -```{list-table} Sample Configuration Options -:header-rows: 1 -:widths: 20 30 25 25 - -* - Parameter - - Description - - Default Value - - Valid Options -* - `batch_size` - - Number of items to process in each batch - - `1000` - - Any positive integer -* - `timeout` - - Maximum time to wait for operation completion - - `30 seconds` - - `1s` to `300s` -* - `enable_logging` - - Whether to enable detailed logging output - - `true` - - `true`, `false` -* - `output_format` - - Format for generated output files - - `json` - - `json`, `csv`, `parquet` -``` diff --git a/docs/feature-set-a/category-a/topic-a/subtopic-b.md b/docs/feature-set-a/category-a/topic-a/subtopic-b.md deleted file mode 100644 index b7d02e0d..00000000 --- a/docs/feature-set-a/category-a/topic-a/subtopic-b.md +++ /dev/null @@ -1,686 +0,0 @@ -(section-category-topic-subtopic-b)= -# Subtopic B - -This subtopic demonstrates advanced MyST markdown and Sphinx-design features for technical writers to observe patterns and techniques. - -(section-category-topic-subtopic-b-admonitions)= -## Admonitions & Callouts - -Use admonitions sparingly to highlight important information without disrupting flow. - -:::{note} -This is a general informational note. Use for surprising or unexpected behavior. -::: - -:::{tip} -This is a helpful tip. Use to reveal positive software behavior users might not discover. -::: - -:::{warning} -This is a warning. Use to identify risk of physical injury or data loss. -::: - -:::{important} -This is for critical information that users must know. -::: - -:::{seealso} -Check out the {ref}`section-category-topic-subtopic-a` for related examples. -::: - -(section-category-topic-subtopic-b-dropdowns)= -## Dropdowns - -Use dropdowns for lengthy code blocks or content that might distract from main flow. - -:::{dropdown} Configuration Example -:icon: gear - -This dropdown contains configuration details that don't interrupt the main narrative. - -```yaml -# Complete configuration file -api: - version: "v2" - timeout: 30s - retry_attempts: 3 - -database: - host: "localhost" - port: 5432 - name: "production_db" - -logging: - level: "INFO" - format: "json" - output: "stdout" -``` -::: - -:::{dropdown} Python Implementation Details -:icon: code-square -:color: primary - -Here's the complete implementation with error handling: - -```python -import logging -import requests -from typing import Dict, Optional, Any - -class APIClient: - def __init__(self, base_url: str, api_key: str): - self.base_url = base_url.rstrip('/') - self.api_key = api_key - self.session = requests.Session() - self.session.headers.update({ - 'Authorization': f'Bearer {api_key}', - 'Content-Type': 'application/json' - }) - - def make_request(self, endpoint: str, method: str = 'GET', - data: Optional[Dict] = None) -> Dict[str, Any]: - """Make an API request with proper error handling.""" - url = f"{self.base_url}/{endpoint.lstrip('/')}" - - try: - response = self.session.request(method, url, json=data) - response.raise_for_status() - return response.json() - except requests.exceptions.RequestException as e: - logging.error(f"API request failed: {e}") - raise -``` -::: - -(section-category-topic-subtopic-b-advanced-grids)= -## Advanced Grid Layouts - -Showcase different grid configurations and responsive behavior. - -### Three-Column Feature Grid - -::::{grid} 1 1 3 3 -:gutter: 2 - -:::{grid-item-card} {octicon}`shield;1.5em;sd-mr-1` Security First -:class-header: sd-bg-primary sd-text-white -Security features built into every component with zero-trust architecture. -+++ -{bdg-success}`Enterprise Ready` {bdg-info}`SOC 2 Compliant` -::: - -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` High Performance -:class-header: sd-bg-success sd-text-white -Optimized for speed with sub-millisecond response times at scale. -+++ -{bdg-primary}`99.9% Uptime` {bdg-secondary}`Auto-scaling` -::: - -:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Developer Friendly -:class-header: sd-bg-info sd-text-white -Comprehensive APIs, SDKs, and documentation for rapid integration. -+++ -{bdg-warning}`OpenAPI 3.0` {bdg-info}`SDK Available` -::: -:::: - -### Two-Column Comparison Layout - -::::{grid} 1 1 2 2 -:gutter: 3 - -:::{grid-item} -:class: sd-border-1 sd-shadow-sm sd-p-3 - -**Traditional Approach** {octicon}`x-circle;1em;sd-text-danger` - -- Manual configuration required -- Limited scalability options -- Complex deployment process -- Higher maintenance overhead -- Vendor lock-in risks - -::: - -:::{grid-item} -:class: sd-border-1 sd-shadow-lg sd-p-3 sd-bg-light - -**Our Solution** {octicon}`check-circle;1em;sd-text-success` - -- Zero-configuration deployment -- Automatic horizontal scaling -- One-click deployment anywhere -- Self-healing infrastructure -- Cloud-agnostic architecture - -::: -:::: - -(section-category-topic-subtopic-b-complex-tabs)= -## Complex Tab Sets - -Demonstrate various tab configurations with synchronization and different content types. - -### Multi-Language Code Examples - -::::{tab-set} - -:::{tab-item} Python -:sync: example-lang - -```python -# Python implementation -import asyncio -import aiohttp - -async def fetch_data(session, url): - async with session.get(url) as response: - return await response.json() - -async def main(): - async with aiohttp.ClientSession() as session: - data = await fetch_data(session, "https://api.example.com/data") - print(f"Received: {data}") - -# Run the async function -asyncio.run(main()) -``` - -**Key Features:** -- Asynchronous processing -- Built-in error handling -- Session management - -::: - -:::{tab-item} Node.js -:sync: example-lang - -```javascript -// Node.js implementation -const fetch = require('node-fetch'); - -async function fetchData(url) { - try { - const response = await fetch(url); - if (!response.ok) { - throw new Error(`HTTP error! status: ${response.status}`); - } - const data = await response.json(); - return data; - } catch (error) { - console.error('Fetch error:', error); - throw error; - } -} - -// Usage -fetchData('https://api.example.com/data') - .then(data => console.log('Received:', data)) - .catch(error => console.error('Error:', error)); -``` - -**Key Features:** -- Promise-based architecture -- Comprehensive error handling -- Modern async/await syntax - -::: - -:::{tab-item} Go -:sync: example-lang - -```go -// Go implementation -package main - -import ( - "encoding/json" - "fmt" - "net/http" - "time" -) - -type APIResponse struct { - Data interface{} `json:"data"` - Status string `json:"status"` - Message string `json:"message"` -} - -func fetchData(url string) (*APIResponse, error) { - client := &http.Client{ - Timeout: 30 * time.Second, - } - - resp, err := client.Get(url) - if err != nil { - return nil, fmt.Errorf("request failed: %w", err) - } - defer resp.Body.Close() - - var result APIResponse - if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { - return nil, fmt.Errorf("decode failed: %w", err) - } - - return &result, nil -} - -func main() { - data, err := fetchData("https://api.example.com/data") - if err != nil { - fmt.Printf("Error: %v\n", err) - return - } - fmt.Printf("Received: %+v\n", data) -} -``` - -**Key Features:** -- Strong type safety -- Explicit error handling -- High performance execution - -::: -:::: - -### Platform-Specific Instructions - -::::{tab-set} - -:::{tab-item} Docker -:sync: platform-deploy - -**Step 1: Create Dockerfile** - -```dockerfile -FROM node:18-alpine -WORKDIR /app -COPY package*.json ./ -RUN npm ci --only=production -COPY . . -EXPOSE 3000 -CMD ["npm", "start"] -``` - -**Step 2: Build and Run** - -```bash -docker build -t my-app . -docker run -p 3000:3000 my-app -``` - -::: - -:::{tab-item} Kubernetes -:sync: platform-deploy - -**Step 1: Create Deployment** - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: my-app -spec: - replicas: 3 - selector: - matchLabels: - app: my-app - template: - metadata: - labels: - app: my-app - spec: - containers: - - name: my-app - image: my-app:latest - ports: - - containerPort: 3000 -``` - -**Step 2: Create Service** - -```yaml -apiVersion: v1 -kind: Service -metadata: - name: my-app-service -spec: - selector: - app: my-app - ports: - - port: 80 - targetPort: 3000 - type: LoadBalancer -``` - -::: - -:::{tab-item} Serverless -:sync: platform-deploy - -**Step 1: Configure Function** - -```yaml -# serverless.yml -service: my-app - -provider: - name: aws - runtime: nodejs18.x - stage: ${opt:stage, 'dev'} - region: us-east-1 - -functions: - api: - handler: src/handler.main - events: - - httpApi: - path: '/{proxy+}' - method: ANY -``` - -**Step 2: Deploy** - -```bash -npm install -g serverless -serverless deploy --stage production -``` - -::: -:::: - -(section-category-topic-subtopic-b-advanced-tables)= -## Advanced Tables & Lists - -### Comparison Matrix - -```{list-table} Feature Comparison Matrix -:header-rows: 1 -:stub-columns: 1 -:widths: 25 20 20 20 15 - -* - Feature - - Starter Plan - - Professional - - Enterprise - - Custom -* - **API Calls/Month** - - 10,000 - - 100,000 - - 1,000,000 - - Unlimited -* - **Storage (GB)** - - 1 - - 10 - - 100 - - Unlimited -* - **Team Members** - - 1 - - 5 - - 25 - - Unlimited -* - **SLA Guarantee** - - 99% - - 99.5% - - 99.9% - - 99.99% -* - **Support Level** - - Community - - Email - - Priority - - Dedicated -* - **Custom Integrations** - - {octicon}`x;1em;sd-text-danger` - - {octicon}`check;1em;sd-text-success` - - {octicon}`check;1em;sd-text-success` - - {octicon}`check;1em;sd-text-success` -* - **SSO Integration** - - {octicon}`x;1em;sd-text-danger` - - {octicon}`x;1em;sd-text-danger` - - {octicon}`check;1em;sd-text-success` - - {octicon}`check;1em;sd-text-success` -``` - -### Configuration Parameters - -```{list-table} Configuration Parameters -:header-rows: 1 -:widths: 20 30 15 15 20 - -* - Parameter - - Description - - Type - - Default - - Example -* - `max_connections` - - Maximum concurrent database connections - - Integer - - `100` - - `200` -* - `timeout_seconds` - - Request timeout in seconds - - Integer - - `30` - - `60` -* - `enable_caching` - - Enable response caching - - Boolean - - `true` - - `false` -* - `cache_ttl_minutes` - - Cache time-to-live in minutes - - Integer - - `60` - - `120` -* - `log_level` - - Logging verbosity level - - String - - `INFO` - - `DEBUG` -``` - -(section-category-topic-subtopic-b-mixed-content)= -## Mixed Content Blocks - -Combine different MyST features for rich, informative sections. - -### Implementation Guide - -::::{dropdown} Prerequisites Checklist -:icon: list-unordered -:color: info - -Before implementing this solution, ensure you have: - -- [ ] Administrative access to your system -- [ ] Valid API credentials configured -- [ ] Network connectivity to required endpoints -- [ ] Minimum 4GB RAM available -- [ ] Python 3.8+ or Node.js 16+ installed - -:::{tip} -Run the system requirements check script first: `python check_requirements.py` -::: -:::: - -::::{tab-set} - -:::{tab-item} Quick Start -:sync: impl-type - -**5-Minute Setup** - -1. Install the package: - ```bash - pip install our-package - ``` - -2. Initialize configuration: - ```bash - our-package init --interactive - ``` - -3. Start the service: - ```bash - our-package start --daemon - ``` - -```{note} -The quick start uses default settings. For production use, follow the complete setup. -``` - -::: - -:::{tab-item} Complete Setup -:sync: impl-type - -**Production-Ready Configuration** - -1. **Environment Preparation** - - ```bash - # Create isolated environment - python -m venv production_env - source production_env/bin/activate - ``` - -2. **Secure Installation** - - ```bash - # Install with security extras - pip install our-package[security,monitoring] - ``` - -3. **Configuration File** - - ````{dropdown} Complete configuration template - :icon: file-code - - ```yaml - # config/production.yml - server: - host: "0.0.0.0" - port: 8080 - workers: 4 - - database: - url: "${DATABASE_URL}" - pool_size: 20 - max_overflow: 30 - - security: - secret_key: "${SECRET_KEY}" - token_expiry: "24h" - rate_limit: "1000/hour" - - monitoring: - metrics_enabled: true - health_check_path: "/health" - log_level: "INFO" - ```` - -4. **Start with Monitoring** - - ```bash - our-package start --config config/production.yml --monitor - ``` - -::: -:::: - -:::{warning} -Always use environment variables for secrets in production. Never commit credentials to version control. -::: - -(section-category-topic-subtopic-b-responsive-design)= -## Responsive Design Examples - -Content that adapts to different screen sizes and contexts. - -### Mobile-First Card Layout - -::::{grid} 1 2 3 4 -:gutter: 1 2 2 3 -:class-container: sd-p-3 - -:::{grid-item-card} {octicon}`database;1.5em` Storage -:class-card: sd-text-center -:shadow: md - -**500GB** -Included storage -::: - -:::{grid-item-card} {octicon}`cloud;1.5em` Bandwidth -:class-card: sd-text-center -:shadow: md - -**1TB/month** -Data transfer -::: - -:::{grid-item-card} {octicon}`cpu;1.5em` Processing -:class-card: sd-text-center -:shadow: md - -**2.4 GHz** -8-core CPU -::: - -:::{grid-item-card} {octicon}`shield;1.5em` Security -:class-card: sd-text-center -:shadow: md - -**256-bit** -Encryption -::: -:::: - -### Flexible Content Blocks - -::::{grid} 1 1 2 2 -:gutter: 2 - -:::{grid-item} -:class: sd-border-2 sd-border-primary sd-p-3 sd-rounded-2 - -#### Getting Started {octicon}`play;1em;sd-text-primary` - -Perfect for developers new to the platform. - -- Interactive tutorials -- Sample projects -- Video walkthroughs -- Community support - -**Time Investment:** 2-3 hours - -::: - -:::{grid-item} -:class: sd-border-2 sd-border-success sd-p-3 sd-rounded-2 - -#### Advanced Integration {octicon}`tools;1em;sd-text-success` - -For teams building production systems. - -- Architecture patterns -- Performance optimization -- Security best practices -- Enterprise features - -**Time Investment:** 1-2 weeks - -::: -:::: - ---- - -:::{seealso} -This comprehensive example demonstrates MyST markdown capabilities. For more patterns, explore: -- {ref}`section-category-topic-subtopic-a` for tabs and tables -- {ref}`section-category-topic` for grid layouts and comparisons -::: \ No newline at end of file diff --git a/docs/feature-set-a/index.md b/docs/feature-set-a/index.md deleted file mode 100644 index 9f63de22..00000000 --- a/docs/feature-set-a/index.md +++ /dev/null @@ -1,34 +0,0 @@ ---- -description: "Master Feature Set A with comprehensive workflows, task guides, tutorials, and reference materials for data processing and analysis." -tags: ["features", "workflows", "tutorials", "data-processing"] -categories: ["features"] ---- - -(feature-set-a)= -# About Feature Set A - -Introduction section. - -:::{note} -This directory will build when you run any make command. -::: - -(feature-set-a-workflow)= -## Workflow - -high level procedural list of a typical workflow using this feature set. - -(feature-set-a-task-guides)= -## Task Guides - -atomic task guide links -- how to achieve 1 thing. - -(feature-set-a-tutorials)= -## Tutorials - -multi-step guides that use the knowledge of task guides and reference articles to achive a user goal. - -(feature-set-a-references)= -## References - -referential information such as schemas, environment variable options, etc. \ No newline at end of file diff --git a/docs/feature-set-a/tutorials/beginner.md b/docs/feature-set-a/tutorials/beginner.md deleted file mode 100644 index 997f61e1..00000000 --- a/docs/feature-set-a/tutorials/beginner.md +++ /dev/null @@ -1,3 +0,0 @@ -(text-tutorials-beginner)= -# Beginner Tutorial - diff --git a/docs/feature-set-a/tutorials/index.md b/docs/feature-set-a/tutorials/index.md deleted file mode 100644 index af075fe7..00000000 --- a/docs/feature-set-a/tutorials/index.md +++ /dev/null @@ -1,49 +0,0 @@ -(feature-set-a-tutorials-index)= -# Text Curation Tutorials - -This section contains practical tutorials that demonstrate how to use NVIDIA NeMo Curator for various text curation tasks. Each tutorial provides step-by-step guidance for specific use cases. - -(feature-set-a-tutorials-beginner)= -## Beginner Tutorials - -General tutorials focusing on product concepts. - -::::{grid} 1 1 1 1 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Beginner Tutorial -:link: text-tutorials-beginner -:link-type: ref -Get started with basic text data processing using NeMo Curator. Learn how to load, clean, and prepare your text data for curation. -+++ -{bdg-primary}`beginner` -{bdg-secondary}`text-processing` -{bdg-secondary}`data-preparation` -::: - -:::{grid-item-card} {octicon}`mortar-board;1.5em;sd-mr-1` Tutorial Series -:link: feature-set-a-tuts-series-a -:link-type: ref -Learn how to generate synthetic data using OpenAI API compatible services and your own deployed LLM. -+++ -{bdg-secondary}`synthetic-data` -{bdg-secondary}`openai-api` -{bdg-secondary}`reward-models` -::: -:::: - -(feature-set-a-tutorials-advanced)= -## Advanced Tutorials - -Use-case driven tutorials focusing on highlighting typical user goals. - -Potentially link out to notebook tutorials. - -```{toctree} -:maxdepth: 2 -:titlesonly: -:hidden: - -Beginner Tutorial -Series A -``` diff --git a/docs/feature-set-a/tutorials/series-a/index.md b/docs/feature-set-a/tutorials/series-a/index.md deleted file mode 100644 index b7e02489..00000000 --- a/docs/feature-set-a/tutorials/series-a/index.md +++ /dev/null @@ -1,2 +0,0 @@ -(feature-set-a-tuts-series-a)= -# Tutorial Series A \ No newline at end of file diff --git a/docs/feature-set-b/category-a/index.md b/docs/feature-set-b/category-a/index.md deleted file mode 100644 index debb90b2..00000000 --- a/docs/feature-set-b/category-a/index.md +++ /dev/null @@ -1,51 +0,0 @@ -(feature-set-b-category-a)= -# Category A - - Introductory text for this category. - -(feature-set-b-category-a-how-it-works)= -## How it Works - -High-level overview pertaining to this subsection's contents. - ---- - -(feature-set-b-category-a-topic-a)= -## Topic A - -Short description about the articles found in Topic A. - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Subtopic A -:link: feature-set-b-category-topic-subtopic-a -:link-type: ref -:link-alt: screendreader description for this link. -description of this article. -+++ -{bdg-secondary}`tag` -{bdg-secondary}`tag` -{bdg-secondary}`tag` -::: - -:::{grid-item-card} {octicon}`filter;1.5em;sd-mr-1` Subtopic B -:link: feature-set-b-category-topic-subtopic-b -:link-type: ref -:link-alt: screendreader description for this link. -description of this article. -+++ -{bdg-secondary}`tag` -{bdg-secondary}`tag` -{bdg-secondary}`tag` -::: - -:::: - -```{toctree} -:maxdepth: 4 -:titlesonly: -:hidden: - -Topic A -``` diff --git a/docs/feature-set-b/category-a/topic-a/index.md b/docs/feature-set-b/category-a/topic-a/index.md deleted file mode 100644 index 4927387f..00000000 --- a/docs/feature-set-b/category-a/topic-a/index.md +++ /dev/null @@ -1,33 +0,0 @@ -(feature-set-b-category-topic)= -# Topic A - -Short active-voice introduction for this subsection. - -(feature-set-b-category-topic-how-it-works)= -## How it Works - -High-level overview pertaining to this subsection's contents. - -::::{tab-set} - -:::{tab-item} Tab A - -Content here - -::: - -:::{tab-item} Tab B - -Content Here - -::: -:::: - -```{toctree} -:maxdepth: 4 -:titlesonly: -:hidden: - -Subtopic A -Subtopic B -``` diff --git a/docs/feature-set-b/category-a/topic-a/subtopic-a.md b/docs/feature-set-b/category-a/topic-a/subtopic-a.md deleted file mode 100644 index a8778983..00000000 --- a/docs/feature-set-b/category-a/topic-a/subtopic-a.md +++ /dev/null @@ -1,2 +0,0 @@ -(feature-set-b-category-topic-subtopic-a)= -# Subtopic A \ No newline at end of file diff --git a/docs/feature-set-b/category-a/topic-a/subtopic-b.md b/docs/feature-set-b/category-a/topic-a/subtopic-b.md deleted file mode 100644 index 0cea971c..00000000 --- a/docs/feature-set-b/category-a/topic-a/subtopic-b.md +++ /dev/null @@ -1,2 +0,0 @@ -(feature-set-b-category-topic-subtopic-b)= -# Subtopic B \ No newline at end of file diff --git a/docs/feature-set-b/index.md b/docs/feature-set-b/index.md deleted file mode 100644 index e542752f..00000000 --- a/docs/feature-set-b/index.md +++ /dev/null @@ -1,40 +0,0 @@ ---- -only: not ga -description: "Explore Feature Set B's advanced integration capabilities and specialized processing tools available in Early Access." -tags: ["features", "integration", "beta", "advanced"] -categories: ["features"] ---- - -(feature-set-b)= -# About Feature Set B - -Introduction section. - -:::{note} -This directory not will build when you run `make docs-live-ga` because it is set to. -``` ---- -only: not ga ---- -``` -::: - -(feature-set-b-workflow)= -## Workflow - -high level procedural list of a typical workflow using this feature set. - -(feature-set-b-task-guides)= -## Task Guides - -atomic task guide links -- how to achieve 1 thing. - -(feature-set-b-tutorials)= -## Tutorials - -multi-step guides that use the knowledge of task guides and reference articles to achive a user goal. - -(feature-set-b-references)= -## References - -referential information such as schemas, environment variable options, etc. \ No newline at end of file diff --git a/docs/feature-set-b/tutorials/beginner.md b/docs/feature-set-b/tutorials/beginner.md deleted file mode 100644 index fef21fb2..00000000 --- a/docs/feature-set-b/tutorials/beginner.md +++ /dev/null @@ -1,40 +0,0 @@ -# Text Curation Beginner Tutorial - -This tutorial demonstrates how to use NeMo Curator to create and curate a dataset of [TBD...]. You'll learn how to: - -- Download and process [TBD data] -- Filter content to focus on [TBD] -- Clean and prepare the text data -- Apply quality filters -- Export the final dataset - -(feature-set-b-tutorials-beginner-prereqs)= -## Before You Start - -- Python 3.8+ -- NeMo Curator installed -- At least 8GB RAM recommended -- ~10GB free disk space for Wikipedia data subset - ---- - -(feature-set-b-tutorials-beginner-setup)= -## Set Up Environment - -First, create a new Python environment and install NeMo Curator: - -```bash -python -m venv nemo_curator_env -source nemo_curator_env/bin/activate # On Windows use: nemo_curator_env\Scripts\activate -pip install nemo-curator -``` - -(feature-set-b-tutorials-beginner-step1)= -## 1. TBD - -(feature-set-b-tutorials-beginner-step2)= -## 2. TBD - -(feature-set-b-tutorials-beginner-step3)= -## 3. TBD - diff --git a/docs/feature-set-b/tutorials/index.md b/docs/feature-set-b/tutorials/index.md deleted file mode 100644 index ddbfbed0..00000000 --- a/docs/feature-set-b/tutorials/index.md +++ /dev/null @@ -1,49 +0,0 @@ -(feature-set-b-tutorials-index)= -# Text Curation Tutorials - -This section contains practical tutorials that demonstrate how to use NVIDIA NeMo Curator for various text curation tasks. Each tutorial provides step-by-step guidance for specific use cases. - -(feature-set-b-tutorials-beginner)= -## Beginner Tutorials - -General tutorials focusing on product concepts. - -::::{grid} 1 1 1 1 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Beginner Tutorial -:link: feature-set-b-tutorials-beginner -:link-type: ref -Get started with basic text data processing using NeMo Curator. Learn how to load, clean, and prepare your text data for curation. -+++ -{bdg-primary}`beginner` -{bdg-secondary}`text-processing` -{bdg-secondary}`data-preparation` -::: - -:::{grid-item-card} {octicon}`mortar-board;1.5em;sd-mr-1` Tutorial Series -:link: feature-set-b-tuts-series-a -:link-type: ref -Learn how to generate synthetic data using OpenAI API compatible services and your own deployed LLM. -+++ -{bdg-secondary}`synthetic-data` -{bdg-secondary}`openai-api` -{bdg-secondary}`reward-models` -::: -:::: - -(feature-set-b-tutorials-advanced)= -## Advanced Tutorials - -Use-case driven tutorials focusing on highlighting typical user goals. - -Potentially link out to notebook tutorials. - -```{toctree} -:maxdepth: 2 -:titlesonly: -:hidden: - -Beginner Tutorial -Series A -``` diff --git a/docs/feature-set-b/tutorials/series-a/index.md b/docs/feature-set-b/tutorials/series-a/index.md deleted file mode 100644 index 0d0950c9..00000000 --- a/docs/feature-set-b/tutorials/series-a/index.md +++ /dev/null @@ -1,2 +0,0 @@ -(feature-set-b-tuts-series-a)= -# Tutorial Series A \ No newline at end of file diff --git a/docs/get-started/feature-set-a.md b/docs/get-started/feature-set-a.md deleted file mode 100644 index c5e768b6..00000000 --- a/docs/get-started/feature-set-a.md +++ /dev/null @@ -1,3 +0,0 @@ -(gs-feature-set-a)= -# Feature Set A Quickstart - diff --git a/docs/get-started/feature-set-b.md b/docs/get-started/feature-set-b.md deleted file mode 100644 index 71b0f08c..00000000 --- a/docs/get-started/feature-set-b.md +++ /dev/null @@ -1,8 +0,0 @@ ---- -only: not ga ---- - -(gs-feature-set-b)= -# Feature Set B Quickstart - -This file will not be included in your `make docs-live-ga` or equivalent build command. \ No newline at end of file diff --git a/docs/get-started/index.md b/docs/get-started/index.md index 64affcff..06903517 100644 --- a/docs/get-started/index.md +++ b/docs/get-started/index.md @@ -1,44 +1,56 @@ --- -description: "Get started quickly with our platform by following these essential setup steps and choosing the right feature set for your needs." +description: "Get started quickly with NeMo Run by following these essential setup steps and tutorials." tags: ["quickstart", "setup", "beginner", "onboarding"] categories: ["getting-started"] --- -(gs-overview)= -# Get Started with Product +(get-started)= +# Get Started with NeMo Run -Intro section +Welcome to NeMo Run! This guide will help you get up and running quickly with ML experiment management. ## Before You Start -- Link -- Stuff A -- Stuff B +- Ensure you have Python 3.8+ installed +- Have pip configured for package installation +- Access to computing resources (local, cloud, or cluster) --- ## Quickstart Options -Intro sentence. +Choose your path to get started with NeMo Run: -::::{grid} 1 1 1 2 +::::{grid} 1 1 1 3 :gutter: 1 1 1 2 -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Feature Set A Quickstart -:link: gs-feature-set-a -:link-type: ref -:link-alt: screenreader alt for link -Get started with ... +:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Installation +:link: install +:link-type: doc +:link-alt: Installation guide +Install NeMo Run and optional dependencies for your environment +++ -{bdg-secondary}`tag` +{bdg-primary}`Start Here` ::: -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Feature Set B Quickstart -:link: gs-feature-set-b -:only: not ga -:link-type: ref -:link-alt: screenreader alt for link -Get started with ... no tags example +:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Quickstart Guide +:link: quickstart +:link-type: doc +:link-alt: Quickstart Guide +Complete guide to install and run your first ML experiment in minutes + ++++ +{bdg-secondary}`Next Steps` +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Tutorials +:link: tutorials +:link-type: doc +:link-alt: Tutorial collection +Learn NeMo Run with hands-on tutorials and examples + ++++ +{bdg-secondary}`Advanced Learning` ::: :::: diff --git a/docs/get-started/install.md b/docs/get-started/install.md new file mode 100644 index 00000000..376abff1 --- /dev/null +++ b/docs/get-started/install.md @@ -0,0 +1,393 @@ +--- +description: "Comprehensive installation guide for NeMo Run with optional dependencies for different computing environments, cloud platforms, and execution backends." +tags: ["installation", "setup", "dependencies", "skypilot", "lepton", "kubernetes", "cloud"] +categories: ["get-started"] +--- + +# Install NeMo Run + +This guide covers the installation of NeMo Run and its optional dependencies for different computing environments and execution backends. + +## Prerequisites + +Before installing NeMo Run, ensure you have the following prerequisites: + +### Version Compatibility + +NeMo Run is compatible with: +- **Python**: 3.8, 3.9, 3.10, 3.11, 3.12 +- **PyTorch**: 2.0+ (for ML workloads) +- **Fiddle**: 0.4+ (for configuration management) +- **TorchX**: 0.5+ (for distributed execution) + +### System Requirements + +- **Python**: 3.8 or higher +- **pip**: Latest version recommended +- **Git**: For cloning repositories and installing from source +- **Operating System**: Linux, macOS, or Windows (with WSL2 recommended for Windows) + +### Python Environment + +We recommend using a virtual environment to isolate dependencies: + +```bash +# Create a virtual environment +python -m venv nemo-run-env + +# Activate the virtual environment +# On Linux/macOS: +source nemo-run-env/bin/activate +# On Windows: +nemo-run-env\Scripts\activate + +# Upgrade pip to latest version +pip install --upgrade pip +``` + +## Core Installation + +### Basic Installation + +Install NeMo Run from the official GitHub repository: + +```bash +pip install git+https://github.com/NVIDIA-NeMo/Run.git +``` + +### Verification + +Verify the installation by checking the version and importing the package: + +```bash +# Check installed version +python -c "import nemo_run; print(nemo_run.__version__ if hasattr(nemo_run, '__version__') else 'Version not available')" + +# Test basic import +python -c "import nemo_run as run; print('✅ NeMo Run installed successfully')" +``` + +## Optional Dependencies + +NeMo Run supports various execution backends and cloud platforms through optional dependencies. + +### SkyPilot Integration + +SkyPilot enables cloud-native execution across multiple cloud providers with automatic resource provisioning and cost optimization. + +#### Installation Options + +::::{tab-set} + +:::{tab-item} Kubernetes Support +:sync: sync-skypilot + +Install SkyPilot with Kubernetes support for local and cloud Kubernetes clusters: + +```bash +pip install git+https://github.com/NVIDIA-NeMo/Run.git[skypilot] +``` + +This includes: + +- SkyPilot core functionality +- Kubernetes cluster management +- Local Kubernetes support (Docker Desktop, minikube, kind) +- Cloud Kubernetes support (GKE, EKS, AKS) +::: + +:::{tab-item} All Cloud Support +:sync: sync-skypilot + +Install SkyPilot with support for all major cloud providers: + +```bash +pip install git+https://github.com/NVIDIA-NeMo/Run.git[skypilot-all] +``` + +This includes: + +- All features from Kubernetes support +- AWS EC2 and EKS support +- Google Cloud Platform (GCP) support +- Microsoft Azure support +- Oracle Cloud Infrastructure (OCI) support +- IBM Cloud support +- Lambda Cloud support +::: + +:::: + +#### Manual SkyPilot Installation + +For custom SkyPilot configurations or specific cloud provider support, install manually: + +```bash +# Install SkyPilot core +pip install skypilot + +# Install specific cloud provider support +pip install skypilot[aws] # AWS support +pip install skypilot[gcp] # Google Cloud support +pip install skypilot[azure] # Azure support +pip install skypilot[lambda] # Lambda Cloud support +``` + +For detailed SkyPilot installation instructions, refer to the [official SkyPilot documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html). + +### DGX Cloud Lepton Integration + +DGX Cloud Lepton provides managed AI infrastructure with pre-configured environments and GPU resources. + +#### Lepton CLI Installation + +Install the Lepton CLI for DGX Cloud integration: + +```bash +pip install leptonai +``` + +#### Authentication Setup + +To authenticate with DGX Cloud Lepton: + +1. **Access the Lepton UI**: Navigate to your DGX Cloud Lepton dashboard +2. **Generate Access Token**: Go to **Settings > Tokens** page +3. **Copy Login Command**: Copy the `lep login` command displayed on the page +4. **Authenticate**: Run the copied command in your terminal + +Example authentication flow: + +```bash +# The command will look similar to this: +lep login --token + +# Verify authentication +lep whoami +``` + +#### Environment Configuration + +Configure your Lepton environment for optimal performance: + +```bash +# Set default project (optional) +lep config set project + +# Configure default region (optional) +lep config set region + +# List available resources +lep resource list +``` + +### Additional Execution Backends + +#### Kubernetes Support + +For Kubernetes-based execution (without SkyPilot): + +```bash +# Install Kubernetes dependencies +pip install kubernetes +pip install kubeconfig + +# For KubeRay support +pip install kuberay +``` + +#### Slurm Support + +For HPC cluster execution via Slurm: + +```bash +# Install Slurm dependencies +pip install pyslurm + +# For SSH tunnel support +pip install paramiko +``` + +#### Ray Support + +For Ray-based distributed computing: + +```bash +# Install Ray with Kubernetes support +pip install "ray[kubernetes]" +``` + +## Development Installation + +### Install from Source + +For development or to use the latest features, install from source: + +```bash +# Clone the repository +git clone https://github.com/NVIDIA-NeMo/Run.git +cd Run + +# Install in development mode +pip install -e . + +# Install development dependencies +pip install -e ".[dev]" +``` + +### Build Documentation + +To build the documentation locally: + +```bash +# Install documentation dependencies +pip install -e ".[docs]" + +# Build HTML documentation +cd docs +make html + +# Serve locally +python -m http.server 8000 -d _build/html +``` + +## Environment Configuration + +### Set Up Environment Variables + +Configure NeMo Run environment variables: + +```bash +# Set NeMo Run home directory (optional) +export NEMORUN_HOME=~/.nemo_run + +# Set log level (optional) +export NEMORUN_LOG_LEVEL=INFO + +# Enable verbose logging (optional) +export NEMORUN_VERBOSE_LOGGING=false +``` + +### Verify Installation + +Test your installation with a simple example: + +```python +import nemo_run as run + +# Test basic configuration +config = run.Config(lambda x: x, x=42) +result = config.build() +print(f"Test result: {result}") + +# Test executor creation +executor = run.LocalExecutor() +print(f"Executor created: {executor}") +``` + +## Troubleshooting + +### Common Installation Issues + +#### Permission Errors + +If you encounter permission errors during installation: + +```bash +# Use user installation +pip install --user git+https://github.com/NVIDIA-NeMo/Run.git + +# Or use a virtual environment +python -m venv nemo-run-env +source nemo-run-env/bin/activate +pip install git+https://github.com/NVIDIA-NeMo/Run.git +``` + +#### Dependency Conflicts + +If you encounter dependency conflicts: + +```bash +# Install with --no-deps and manually resolve +pip install git+https://github.com/NVIDIA-NeMo/Run.git --no-deps + +# Then install dependencies manually +pip install inquirerpy catalogue fabric fiddle torchx typer rich jinja2 cryptography networkx omegaconf leptonai packaging toml +``` + +#### Git Installation Issues + +If you have issues with Git installation: + +```bash +# Ensure Git is installed +git --version + +# Use HTTPS instead of SSH +pip install git+https://github.com/NVIDIA-NeMo/Run.git + +# Or download and install manually +git clone https://github.com/NVIDIA-NeMo/Run.git +cd Run +pip install . +``` + +### Verification Commands + +Run these commands to verify your installation: + +```bash +# Check Python version +python --version + +# Check pip version +pip --version + +# Check NeMo Run installation +python -c "import nemo_run; print(f'NeMo Run version: {nemo_run.__version__ if hasattr(nemo_run, \"__version__\") else \"Version not available\"}')" + +# Check CLI availability +python -c "from nemo_run.__main__ import app; print('CLI available')" + +# Check executor imports +python -c "from nemo_run.core.execution import LocalExecutor, SlurmExecutor; print('Executors available')" +``` + +## Testing Your Installation + +After installation, test your setup with these commands: + +```bash +# Test basic import and functionality +python -c " +import nemo_run as run +print('✅ NeMo Run imported successfully') + +# Test configuration creation +config = run.Config(lambda x: x, x=42) +print('✅ Configuration created successfully') + +# Test executor creation +executor = run.LocalExecutor() +print('✅ Executor created successfully') + +print('🎉 All tests passed!') +" +``` + +## Next Steps + +After successful installation: + +1. **Read the Configuration Guide**: Learn about `run.Config` and `run.Partial` +2. **Try the CLI Tutorial**: Create your first CLI entrypoint +3. **Explore Execution Backends**: Test different execution environments +4. **Check the Examples**: Review example configurations and workflows + +For more detailed information, refer to the [Configuration Guide](../guides/configuration), [CLI Reference](../reference/cli), and [Execution Guide](../guides/execution). + +--- + +::{note} +**Important**: Ensure you have `pip` installed and configured properly before proceeding with the installation. For production deployments, consider using containerized environments for consistent execution across different platforms. +::: diff --git a/docs/get-started/quickstart.md b/docs/get-started/quickstart.md new file mode 100644 index 00000000..9fc984bd --- /dev/null +++ b/docs/get-started/quickstart.md @@ -0,0 +1,390 @@ +--- +description: "Complete quickstart guide for AI developers - Install NeMo Run and run your first ML experiment in minutes." +tags: ["quickstart", "installation", "first-experiment", "ai-developer", "ml-workflow"] +categories: ["get-started"] +--- + +(quickstart)= + +# Quickstart + +Get up and running with NeMo Run in under 10 minutes. This guide will walk you through installation, basic configuration, and your first ML experiment. + +## Prerequisites + +- **Python 3.8+** with pip +- **Git** for cloning repositories +- **Basic ML knowledge** (PyTorch, training loops, etc.) + +## Installation + +### 1. Create Virtual Environment + +```bash +# Create and activate virtual environment +python -m venv nemo-run-env +source nemo-run-env/bin/activate # Linux/macOS +# or +nemo-run-env\Scripts\activate # Windows + +# Upgrade pip +pip install --upgrade pip +``` + +### 2. Install NeMo Run + +```bash +# Core installation +pip install git+https://github.com/NVIDIA-NeMo/Run.git + +# Verify installation +try: + import nemo_run as run + print('✅ NeMo Run installed successfully') +except ImportError as e: + print(f'❌ NeMo Run installation failed: {e}') +``` + +### 3. Optional: Install Cloud Dependencies + +For cloud execution (AWS, GCP, Azure): + +```bash +# SkyPilot for multi-cloud support +pip install git+https://github.com/NVIDIA-NeMo/Run.git[skypilot] + +# Or install manually +pip install skypilot +``` + +## Your First Experiment + +Let's create a complete ML experiment that demonstrates NeMo Run's core features. + +### Step 1: Create Your Training Function + +Create a file `train_model.py`: + +```python +import torch +import torch.nn as nn +import torch.optim as optim +from torch.utils.data import DataLoader, TensorDataset +import numpy as np + +def train_model( + model_size: int = 128, + learning_rate: float = 0.001, + batch_size: int = 32, + epochs: int = 10, + data_size: int = 1000 +): + """ + Simple ML training function with configurable parameters. + + Args: + model_size: Hidden layer size + learning_rate: Learning rate for optimizer + batch_size: Training batch size + epochs: Number of training epochs + data_size: Size of synthetic dataset + """ + # Generate synthetic data + X = torch.randn(data_size, 10) + y = torch.sum(X, dim=1, keepdim=True) + torch.randn(data_size, 1) * 0.1 + + # Create model + model = nn.Sequential( + nn.Linear(10, model_size), + nn.ReLU(), + nn.Linear(model_size, 1) + ) + + # Setup training + optimizer = optim.Adam(model.parameters(), lr=learning_rate) + criterion = nn.MSELoss() + dataset = TensorDataset(X, y) + dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True) + + # Training loop + model.train() + losses = [] + + for epoch in range(epochs): + epoch_loss = 0.0 + for batch_X, batch_y in dataloader: + optimizer.zero_grad() + outputs = model(batch_X) + loss = criterion(outputs, batch_y) + loss.backward() + optimizer.step() + epoch_loss += loss.item() + + avg_loss = epoch_loss / len(dataloader) + losses.append(avg_loss) + + if epoch % 2 == 0: + print(f"Epoch {epoch}: Loss = {avg_loss:.4f}") + + # Evaluate final model + model.eval() + with torch.no_grad(): + test_loss = criterion(model(X), y).item() + + return { + "final_loss": test_loss, + "loss_history": losses, + "model_params": sum(p.numel() for p in model.parameters()), + "config": { + "model_size": model_size, + "learning_rate": learning_rate, + "batch_size": batch_size, + "epochs": epochs + } + } +``` + +### Step 2: Configure Your Experiment + +Create a file `experiment_config.py`: + +```python +import nemo_run as run +from train_model import train_model + +# Create a partial function with default parameters +train_fn = run.Partial( + train_model, + model_size=128, + learning_rate=0.001, + batch_size=32, + epochs=10 +) + +# Create different configurations for experimentation +configs = [ + run.Config(train_fn, model_size=64, learning_rate=0.01), + run.Config(train_fn, model_size=128, learning_rate=0.001), + run.Config(train_fn, model_size=256, learning_rate=0.0001), +] + +print("✅ Experiment configurations created") +print(f"Number of configurations: {len(configs)}") +``` + +### Step 3: Run Your First Experiment + +Create a file `run_experiment.py`: + +```python +import nemo_run as run +from experiment_config import configs + +# Create an experiment to manage multiple runs +with run.Experiment("quickstart-experiment") as experiment: + # Add all configurations to the experiment + for i, config in enumerate(configs): + print(f"\n🚀 Adding configuration {i+1}/{len(configs)}") + + # Add the task to the experiment + experiment.add( + config, + executor=run.LocalExecutor(), + name=f"config_{i+1}" + ) + + # Run all tasks in the experiment + print("\n🚀 Launching experiment...") + experiment.run() + + # Get results from the experiment + results = [] + for i, job in enumerate(experiment.jobs): + try: + if job.state == run.AppState.SUCCEEDED: + # For local execution, we can get the result directly + # In a real scenario, you'd access logs and artifacts + print(f"✅ Configuration {i+1} completed successfully") + print(f" Job ID: {job.id}") + print(f" Status: {job.state}") + else: + print(f"❌ Configuration {i+1} failed with status: {job.state}") + except Exception as e: + print(f"❌ Configuration {i+1} error: {e}") + +print("\n🎉 Your first NeMo Run experiment is complete!") +``` + +### Alternative: Simple Single Task Execution + +If you prefer a simpler approach without experiment management: + +```python +import nemo_run as run +from experiment_config import configs + +# Run each configuration individually +results = [] +for i, config in enumerate(configs): + print(f"\n🚀 Running configuration {i+1}/{len(configs)}") + + try: + # Run the task directly + result = run.run(config, executor=run.LocalExecutor()) + results.append(result) + + print(f"✅ Configuration {i+1} completed") + print(f" Result: {result}") + except Exception as e: + print(f"❌ Configuration {i+1} failed: {e}") + results.append(None) + +print("\n🎉 All configurations completed!") +``` + +### Step 4: Execute the Experiment + +```bash +# Run the experiment +python run_experiment.py +``` + +You should see output similar to: + +``` +✅ Experiment configurations created +Number of configurations: 3 + +🚀 Adding configuration 1/3 +🚀 Adding configuration 2/3 +🚀 Adding configuration 3/3 + +🚀 Launching experiment... +Epoch 0: Loss = 0.1234 +Epoch 2: Loss = 0.0987 +... +✅ Configuration 1 completed successfully + Job ID: config_1 + Status: SUCCEEDED +✅ Configuration 2 completed successfully + Job ID: config_2 + Status: SUCCEEDED +✅ Configuration 3 completed successfully + Job ID: config_3 + Status: SUCCEEDED + +🎉 Your first NeMo Run experiment is complete! +``` + +## Next Steps + +### Explore Advanced Features + +1. **Remote Execution**: Try running on different backends + + ```python + # Docker execution + executor = run.DockerExecutor(image="pytorch/pytorch:latest") + + # Slurm execution (if available) + executor = run.SlurmExecutor(partition="gpu", gpus_per_node=1) + ``` + +2. **Experiment Tracking**: Add metrics and logging + + ```python + # In your training function, you can return metrics + return { + "loss": loss.item(), + "accuracy": accuracy, + "learning_rate": learning_rate, + "epoch": epoch + } + ``` + +3. **Hyperparameter Tuning**: Create parameter sweeps + + ```python + # Grid search + with run.Experiment("hyperparameter-sweep") as exp: + for lr in [0.001, 0.01, 0.1]: + for batch_size in [16, 32, 64]: + config = run.Config(train_fn, learning_rate=lr, batch_size=batch_size) + exp.add(config, executor=run.LocalExecutor()) + exp.run() + ``` + +### Learn More + +- **Configuration Guide**: Master `run.Config` and `run.Partial` +- **Execution Guide**: Explore different executors and backends +- **Management Guide**: Advanced experiment tracking and management +- **Tutorials**: Hands-on examples and advanced workflows + +## Troubleshooting + +### Common Issues + +**Import Error**: `ModuleNotFoundError: No module named 'nemo_run'` + +```bash +# Ensure virtual environment is activated +source nemo-run-env/bin/activate +pip install git+https://github.com/NVIDIA-NeMo/Run.git +``` + +**Configuration Errors**: If you encounter serialization errors + +```python +# Wrap non-serializable objects in run.Config +import pathlib +config = run.Config(MyClass, data_path=run.Config(pathlib.Path, "/tmp/data")) +``` + +**CUDA Issues**: If you encounter CUDA-related errors + +```python +# Force CPU execution +import torch +torch.cuda.is_available = lambda: False +``` + +**Memory Issues**: For large models or datasets + +```python +# Use smaller batch sizes or model sizes +config = run.Config(train_fn, batch_size=16, model_size=64) +``` + +**Experiment Errors**: If experiments fail to run + +```python +# Add proper error handling +try: + with run.Experiment("test") as exp: + exp.add(config, executor=run.LocalExecutor()) + exp.run() +except Exception as e: + print(f"Experiment failed: {e}") + # Add cleanup code here +``` + +## What You've Learned + +✅ **Configuration Management**: Using `run.Config` and `run.Partial` for flexible parameter management + +✅ **Experiment Tracking**: Creating and managing experiments with `run.Experiment` + +✅ **Local Execution**: Running ML workloads with `run.LocalExecutor` + +✅ **Result Collection**: Accessing and analyzing experiment results + +✅ **Basic Workflow**: Complete ML experiment lifecycle with NeMo Run + +You're now ready to scale your ML experiments across different environments and build more complex workflows! + +## Need Help? + +- **Documentation**: Explore the [Configuration](../guides/configuration), [Execution](../guides/execution), and [Management](../guides/management) guides +- **Examples**: Check out the [tutorials](tutorials.md) for more advanced examples +- **Reference**: Consult the [CLI reference](../reference/cli) and [glossary](../reference/glossary) for detailed information diff --git a/docs/get-started/tutorials.md b/docs/get-started/tutorials.md new file mode 100644 index 00000000..62228df0 --- /dev/null +++ b/docs/get-started/tutorials.md @@ -0,0 +1,366 @@ +--- +description: "Comprehensive tutorials and learning resources for NeMo Run - Learn by example with hands-on guides, notebooks, and practical exercises." +tags: ["tutorials", "examples", "hello-world", "notebooks", "learning", "hands-on"] +categories: ["get-started"] +--- + +(tutorials)= + +# Tutorials and Learning Resources + +Welcome to the NeMo Run tutorial collection! This comprehensive guide provides hands-on learning experiences to help you master NeMo Run's capabilities for machine learning experiment management and distributed computing. + +## Learning Path + +Our tutorials are designed to guide you from basic concepts to advanced workflows: + +1. **Getting Started**: Basic configuration and execution +2. **Experiment Management**: Creating and tracking experiments +3. **Advanced Workflows**: Script-based execution and automation +4. **Distributed Computing**: Ray clusters and cloud execution +5. **Production Deployment**: Best practices and optimization + +## Tutorial Series + +### Hello World Series + +The `hello_world` tutorial series provides a comprehensive introduction to NeMo Run, demonstrating its core capabilities through practical examples. + +#### What You'll Learn + +- **Configuration Management**: Using `Partial` and `Config` classes for flexible parameter management +- **Execution Backends**: Running functions locally and on remote clusters +- **Visualization**: Creating configuration diagrams with `graphviz` +- **Experiment Tracking**: Managing experiments with `run.Experiment` +- **Automation**: Script-based execution and workflow automation + +#### Tutorial Structure + +::::{grid} 1 1 1 3 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Part 1: Hello World +:link: +:link-type: url +:link-alt: Hello World tutorial part 1 + +**Basic Configuration and Execution** + +Learn the fundamentals of NeMo Run configuration and execution: + +- Creating and configuring Python functions +- Using `Partial` for parameter management +- Basic execution on local and remote backends +- Understanding the execution model +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Part 2: Hello Experiments +:link: +:link-type: url +:link-alt: Hello World tutorial part 2 + +**Experiment Management and Tracking** + +Master experiment lifecycle management: + +- Creating and managing experiments +- Parameter tracking and versioning +- Result collection and analysis +- Experiment comparison and visualization +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Part 3: Hello Scripts +:link: +:link-type: url +:link-alt: Hello World tutorial part 3 + +**Script-Based Execution and Automation** + +Build automated workflows: + +- Script-based experiment execution +- Batch processing and automation +- Integration with CI/CD pipelines +- Production deployment patterns +::: + +:::: + +## Advanced Tutorials + +### Ray Distributed Computing + +Learn to leverage Ray for distributed computing across Kubernetes and Slurm environments. + +#### Ray Cluster Management + +```python +# Example: Deploying a Ray cluster on Kubernetes +from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup +from nemo_run.run.ray.cluster import RayCluster + +# Configure KubeRay executor +executor = KubeRayExecutor( + namespace="ml-team", + worker_groups=[KubeRayWorkerGroup( + group_name="worker", + replicas=2, + gpus_per_worker=8 + )] +) + +# Deploy cluster +cluster = RayCluster(name="training-cluster", executor=executor) +cluster.start() +``` + +#### Ray Job Submission + +```python +# Example: Submitting jobs to Ray cluster +from nemo_run.run.ray.job import RayJob + +# Submit training job +job = RayJob(name="training-job", executor=executor) +job.start( + command="python train.py --config config.yaml", + workdir="/workspace/project/" +) + +# Monitor execution +job.logs(follow=True) +``` + +### Cloud Execution with SkyPilot + +Master cloud-native execution with automatic resource provisioning and cost optimization. + +#### Multi-Cloud Deployment + +```python +# Example: SkyPilot multi-cloud execution +import nemo_run as run +from nemo_run.core.execution.skypilot import SkyPilotExecutor + +# Configure SkyPilot executor +executor = SkyPilotExecutor( + cloud="aws", # or "gcp", "azure", "lambda" + instance_type="g4dn.xlarge", + region="us-west-2" +) + +# Execute function on cloud +@run.partial(executor=executor) +def train_model(config): + # Training logic here + return model + +result = train_model(config) +``` + +### Experiment Management + +Learn advanced experiment tracking and management techniques. + +#### Experiment Lifecycle + +```python +# Example: Comprehensive experiment management +import nemo_run as run + +# Create experiment +experiment = run.Experiment( + name="hyperparameter-sweep", + description="Sweep learning rate and batch size" +) + +# Define parameter space +configs = [ + {"lr": 0.001, "batch_size": 32}, + {"lr": 0.01, "batch_size": 64}, + {"lr": 0.1, "batch_size": 128} +] + +# Run experiments +for config in configs: + with experiment.run(config) as run_context: + result = train_model(config) + run_context.log_metrics({ + "accuracy": result.accuracy, + "loss": result.loss + }) +``` + +## Interactive Examples + +### Jupyter Notebooks + +Explore interactive examples in Jupyter notebooks: + +- **Configuration Examples**: Learn configuration patterns and best practices +- **Execution Backends**: Compare different execution environments +- **Experiment Tracking**: Visualize experiment results and metrics +- **Distributed Computing**: Hands-on Ray cluster management + +### Code Examples + +Run complete code examples to understand NeMo Run workflows: + +```python +# Complete example: ML training pipeline +import nemo_run as run +from nemo_run.core.execution.docker import DockerExecutor + +# Define training function +@run.partial +def train_model( + model_name: str = "gpt2", + learning_rate: float = 0.001, + batch_size: int = 32, + epochs: int = 10 +): + """Train a machine learning model.""" + # Training implementation + return {"accuracy": 0.95, "loss": 0.05} + +# Configure executor +executor = DockerExecutor( + image="nvidia/pytorch:24.05-py3", + gpus=1 +) + +# Execute training +result = train_model.with_executor(executor)( + model_name="gpt2-large", + learning_rate=0.0001, + batch_size=64 +) + +print(f"Training completed: {result}") +``` + +## Learning Resources + +### Documentation + +- **Configuration Guide**: Deep dive into configuration management +- **Execution Guide**: Understanding execution backends and environments +- **CLI Guide**: Command-line interface usage and automation +- **Ray Guide**: Distributed computing with Ray clusters + +### Community Resources + +- **GitHub Repository**: Source code and issue tracking +- **Discussions**: Community forums and Q&A +- **Examples**: Additional code examples and use cases +- **Contributing**: Guidelines for contributing to NeMo Run + +### Best Practices + +#### Configuration Management + +```python +# Best practice: Structured configuration +from dataclasses import dataclass +from typing import Optional + +@dataclass +class ModelConfig: + name: str + hidden_size: int = 512 + num_layers: int = 12 + dropout: float = 0.1 + +@dataclass +class TrainingConfig: + learning_rate: float = 0.001 + batch_size: int = 32 + epochs: int = 100 + optimizer: str = "adam" + +# Use structured configs +model_config = ModelConfig(name="gpt2", hidden_size=768) +training_config = TrainingConfig(learning_rate=0.0001, batch_size=64) +``` + +#### Error Handling + +```python +# Best practice: Robust error handling +import nemo_run as run +from nemo_run.core.execution.docker import DockerExecutor + +@run.partial +def robust_training(config): + try: + # Training logic + result = train_model(config) + return result + except Exception as e: + # Log error and return fallback + run.logger.error(f"Training failed: {e}") + return {"error": str(e), "status": "failed"} + +# Execute with error handling +executor = DockerExecutor(image="nvidia/pytorch:24.05-py3") +result = robust_training.with_executor(executor)(config) +``` + +#### Resource Management + +```python +# Best practice: Resource-aware execution +from nemo_run.core.execution.kuberay import KubeRayExecutor + +# Configure resource limits +executor = KubeRayExecutor( + worker_groups=[KubeRayWorkerGroup( + group_name="worker", + replicas=2, + gpus_per_worker=4, + cpu_per_worker="8", + memory_per_worker="32Gi" + )], + # Set resource limits + resource_limits={ + "cpu": "16", + "memory": "64Gi", + "nvidia.com/gpu": "8" + } +) +``` + +## Getting Help + +### Troubleshooting + +Common issues and solutions: + +1. **Installation Problems**: Check prerequisites and dependencies +2. **Configuration Errors**: Validate configuration syntax and structure +3. **Execution Failures**: Review logs and resource requirements +4. **Performance Issues**: Optimize resource allocation and execution settings + +### Support Channels + +- **Documentation**: Comprehensive guides and API reference +- **GitHub Issues**: Bug reports and feature requests +- **Community Forums**: Discussion and Q&A +- **Examples Repository**: Working code examples + +## Next Steps + +After completing the tutorials: + +1. **Explore Advanced Features**: Dive into distributed computing and cloud execution +2. **Build Your Workflows**: Create custom experiment pipelines +3. **Optimize Performance**: Learn best practices for production deployment +4. **Contribute**: Share your experiences and contribute to the community + +For additional learning resources and community support, visit the [NeMo Run GitHub repository](https://github.com/NVIDIA-NeMo/Run) and [documentation](https://docs.nemo.run). + +--- + +:::{note} +**Note**: The tutorial files referenced in this guide are available in the [NeMo Run examples repository](https://github.com/NVIDIA-NeMo/Run/tree/main/examples). Clone the repository to access the complete tutorial notebooks and scripts. +::: diff --git a/docs/guides/configuration.md b/docs/guides/configuration.md new file mode 100644 index 00000000..52b8a311 --- /dev/null +++ b/docs/guides/configuration.md @@ -0,0 +1,744 @@ +# Configure NeMo Run + +NeMo Run provides a flexible configuration system that allows you to define machine learning experiments in a type-safe, reproducible manner. This guide covers the two main configuration approaches supported by NeMo Run. + +For detailed definitions of configuration terms and concepts, see the [Glossary](../reference/glossary). + +## Configuration Overview + +NeMo Run supports two primary configuration systems: + +- **Python-based configuration**: Type-safe, structured configuration using Fiddle +- **Raw scripts and commands**: Direct script execution for custom workflows + +The Python-based system is the recommended approach for most use cases, offering better type safety, validation, and reproducibility. Raw scripts provide flexibility for legacy workflows or custom execution requirements. + +## Python-Based Configuration + +NeMo Run's Python configuration system is built on top of Fiddle, providing a powerful and flexible way to define experiments. The system uses two main primitives: `run.Config` and `run.Partial`. + +### Core Configuration Primitives + +#### `run.Config` + +`run.Config` creates a complete configuration for a class or function: + +```python +import nemo_run as run +from nemo.collections.llm import LlamaModel, Llama3Config8B + +# Create a configuration for a model +model_config = run.Config( + LlamaModel, + config=run.Config( + Llama3Config8B, + seq_length=16384, + hidden_size=4096, + num_attention_heads=32 + ) +) + +# Build the configuration to instantiate the object +import fiddle as fdl +try: + model = fdl.build(model_config) + print("✅ Model configuration built successfully") +except Exception as e: + print(f"❌ Failed to build model configuration: {e}") + raise +``` + +#### `run.Partial` + +`run.Partial` creates a partially applied function with some arguments fixed: + +```python +# Create a partial function for training +train_fn = run.Partial( + train_model, + optimizer="adam", + learning_rate=0.001, + batch_size=32 +) + +# Later, you can call it with additional arguments +import fiddle as fdl +try: + result = fdl.build(train_fn)(data_path="/path/to/data") + print(f"✅ Training completed with result: {result}") +except Exception as e: + print(f"❌ Training failed: {e}") + raise +``` + +### Configuration Patterns + +#### Basic Model Configuration + +```python +def create_model_config( + model_size: str = "8b", + seq_length: int = 16384, + hidden_size: int = 4096 +) -> run.Config: + """Create a standardized model configuration.""" + + if model_size == "8b": + return run.Config( + LlamaModel, + config=run.Config( + Llama3Config8B, + seq_length=seq_length, + hidden_size=hidden_size + ) + ) + elif model_size == "70b": + return run.Config( + LlamaModel, + config=run.Config( + Llama3Config70B, + seq_length=seq_length, + hidden_size=8192 + ) + ) + else: + raise ValueError(f"Unsupported model size: {model_size}") + +# Usage +model_config = create_model_config(model_size="8b", seq_length=8192) +``` + +#### Training Configuration + +```python +def create_training_config( + model_config: run.Config, + num_nodes: int = 1, + gpus_per_node: int = 8, + batch_size: int = 512 +) -> run.Config: + """Create a complete training configuration.""" + + return run.Config( + TrainingJob, + model=model_config, + trainer=run.Config( + Trainer, + num_nodes=num_nodes, + gpus_per_node=gpus_per_node, + precision="bf16-mixed", + max_epochs=100 + ), + data=run.Config( + DataModule, + batch_size=batch_size, + num_workers=4 + ), + optimizer=run.Config( + AdamW, + lr=3e-4, + weight_decay=0.01 + ) + ) + +# Usage +training_config = create_training_config( + model_config=model_config, + num_nodes=4, + gpus_per_node=8 +) +``` + +### Advanced Configuration Features + +#### Configuration Composition + +Combine multiple configurations into complex workflows: + +```python +# Data preprocessing configuration +preprocess_config = run.Config( + PreprocessData, + input_path="/data/raw", + output_path="/data/processed", + tokenizer="llama-tokenizer" +) + +# Training configuration +training_config = run.Config( + TrainModel, + model=model_config, + data_path="/data/processed" +) + +# Evaluation configuration +eval_config = run.Config( + EvaluateModel, + model_path="/checkpoints/best", + test_data="/data/test" +) + +# Complete pipeline +pipeline_config = run.Config( + TrainingPipeline, + preprocess=preprocess_config, + training=training_config, + evaluation=eval_config +) +``` + +#### Configuration Validation + +Add validation to your configurations: + +```python +def validate_training_config(config: run.Config) -> bool: + """Validate training configuration parameters.""" + + trainer = config.trainer + data = config.data + + # Check resource requirements + if trainer.num_nodes * trainer.gpus_per_node < 1: + raise ValueError("At least one GPU is required") + + # Check batch size compatibility + if data.batch_size % trainer.gpus_per_node != 0: + raise ValueError("Batch size must be divisible by number of GPUs") + + # Check memory requirements + estimated_memory = data.batch_size * config.model.config.seq_length * 4 # bytes + if estimated_memory > 32 * 1024**3: # 32GB + print("Warning: High memory usage detected") + + return True + +# Usage +if validate_training_config(training_config): + experiment = run.submit(training_config, executor) +``` + +#### Using `run.autoconvert` + +The `@run.autoconvert` decorator automatically converts regular Python functions to NeMo Run configurations: + +```python +import nemo_run as run +from nemo.collections.llm import LlamaModel, Llama3Config8B + +@run.autoconvert +def create_llama_model(seq_length: int = 16384) -> LlamaModel: + """Create a Llama model with specified sequence length.""" + return LlamaModel( + config=Llama3Config8B( + seq_length=seq_length, + hidden_size=4096, + num_attention_heads=32 + ) + ) + +# This automatically becomes a run.Config +model_config = create_llama_model(seq_length=8192) +``` + +**Limitations of `@run.autoconvert`:** + +- No support for control flow (if/else, loops, comprehensions) +- No support for complex expressions +- Limited to simple function definitions + +**Workaround for complex logic:** + +```python +def create_adaptive_model_config( + model_size: str, + seq_length: int, + use_flash_attention: bool = True +) -> run.Config: + """Create model configuration with complex logic.""" + + # Complex logic that can't be in @run.autoconvert + if model_size == "8b": + base_config = Llama3Config8B + hidden_size = 4096 + elif model_size == "70b": + base_config = Llama3Config70B + hidden_size = 8192 + else: + raise ValueError(f"Unsupported model size: {model_size}") + + # Dynamic parameter calculation + attention_heads = hidden_size // 128 + if use_flash_attention: + attention_implementation = "flash_attention_2" + else: + attention_implementation = "eager" + + return run.Config( + LlamaModel, + config=run.Config( + base_config, + seq_length=seq_length, + hidden_size=hidden_size, + num_attention_heads=attention_heads, + attention_implementation=attention_implementation + ) + ) +``` + +### Configuration Utilities + +#### Broadcasting Values + +Apply values across nested configurations: + +```python +# Create base configuration +config = run.Config( + TrainingJob, + model=run.Config(LlamaModel, config=run.Config(Llama3Config8B)), + data=run.Config(DataModule, batch_size=32), + optimizer=run.Config(AdamW, lr=0.001) +) + +# Broadcast learning rate to all optimizers +config.broadcast(lr=0.0001) + +# Broadcast batch size to all data modules +config.broadcast(batch_size=64) +``` + +#### Walking Configurations + +Apply transformations to nested configurations: + +```python +# Double all learning rates +config.walk(lr=lambda cfg: cfg.lr * 2) + +# Set all sequence lengths to a specific value +config.walk(seq_length=lambda cfg: 8192) + +# Apply custom transformation +def scale_batch_size(cfg): + if hasattr(cfg, 'batch_size'): + cfg.batch_size = min(cfg.batch_size * 2, 1024) + return cfg + +config.walk(scale_batch_size) +``` + +### YAML Equivalence + +NeMo Run configurations can be understood in terms of YAML/Hydra syntax, making it easier to transition from YAML-based systems. + +#### Basic Configuration Mapping + +**Python configuration:** + +```python +config = run.Config( + LlamaModel, + config=run.Config( + Llama3Config8B, + seq_length=16384, + hidden_size=4096 + ) +) +``` + +**Equivalent YAML:** + +```yaml +_target_: nemo.collections.llm.gpt.model.llama.LlamaModel +config: + _target_: nemo.collections.llm.gpt.model.llama.Llama3Config8B + seq_length: 16384 + hidden_size: 4096 +``` + +#### Partial Function Mapping + +**Python partial:** + +```python +partial = run.Partial( + train_model, + optimizer="adam", + learning_rate=0.001 +) +``` + +**Equivalent YAML:** + +```yaml +_target_: train_model +_partial_: true +optimizer: adam +learning_rate: 0.001 +``` + +#### Configuration Operations + +**Python operations:** + +```python +# Modify configuration +config.config.seq_length *= 2 +config.config.hidden_size = 8192 + +# Broadcast values +config.broadcast(learning_rate=0.0001) +``` + +**Equivalent YAML transformations:** + +```yaml +# After modification +_target_: nemo.collections.llm.gpt.model.llama.LlamaModel +config: + _target_: nemo.collections.llm.gpt.model.llama.Llama3Config8B + seq_length: 32768 # Doubled + hidden_size: 8192 # Changed +``` + +## Raw Script Configuration + +For legacy workflows or custom execution requirements, NeMo Run supports direct script execution. + +### File-Based Scripts + +Execute scripts from files: + +```python +# Execute a shell script +script = run.Script("./scripts/train_model.sh") + +# Execute with environment variables +script = run.Script( + "./scripts/train_model.sh", + env_vars={ + "CUDA_VISIBLE_DEVICES": "0,1,2,3", + "PYTHONPATH": "/path/to/code", + "DATA_PATH": "/path/to/data" + } +) + +# Execute with arguments +script = run.Script( + "./scripts/train_model.sh", + args=["--model-size", "8b", "--batch-size", "512"] +) +``` + +### Inline Scripts + +Execute scripts defined inline: + +```python +# Simple inline script +inline_script = run.Script( + inline=""" +#!/bin/bash +set -e + +echo "Starting training..." +export CUDA_VISIBLE_DEVICES=0,1,2,3 +export PYTHONPATH=/path/to/code + +python train.py \ + --model-size 8b \ + --batch-size 512 \ + --learning-rate 0.001 \ + --max-epochs 100 +""" +) + +# Complex inline script with multiple commands +complex_script = run.Script( + inline=""" +#!/bin/bash +set -e + +# Setup environment +source /opt/conda/etc/profile.d/conda.sh +conda activate nemo + +# Download data if not exists +if [ ! -d "/data/dataset" ]; then + echo "Downloading dataset..." + python download_data.py --output /data/dataset +fi + +# Preprocess data +echo "Preprocessing data..." +python preprocess.py \ + --input /data/dataset \ + --output /data/processed \ + --tokenizer llama-tokenizer + +# Train model +echo "Starting training..." +python train.py \ + --model-size 8b \ + --data-path /data/processed \ + --batch-size 512 \ + --learning-rate 0.001 \ + --max-epochs 100 \ + --checkpoint-dir /checkpoints + +# Evaluate model +echo "Evaluating model..." +python evaluate.py \ + --model-path /checkpoints/best \ + --test-data /data/test +""" +) +``` + +### Script Configuration Patterns + +#### Parameterized Scripts + +Create reusable script templates: + +```python +def create_training_script( + model_size: str, + batch_size: int, + learning_rate: float, + max_epochs: int +) -> run.Script: + """Create a parameterized training script.""" + + script_content = f""" +#!/bin/bash +set -e + +# Training parameters +MODEL_SIZE={model_size} +BATCH_SIZE={batch_size} +LEARNING_RATE={learning_rate} +MAX_EPOCHS={max_epochs} + +echo "Training configuration:" +echo " Model size: $MODEL_SIZE" +echo " Batch size: $BATCH_SIZE" +echo " Learning rate: $LEARNING_RATE" +echo " Max epochs: $MAX_EPOCHS" + +# Execute training +python train.py \\ + --model-size $MODEL_SIZE \\ + --batch-size $BATCH_SIZE \\ + --learning-rate $LEARNING_RATE \\ + --max-epochs $MAX_EPOCHS \\ + --checkpoint-dir /checkpoints +""" + + return run.Script(inline=script_content) + +# Usage +script = create_training_script( + model_size="8b", + batch_size=512, + learning_rate=0.001, + max_epochs=100 +) +``` + +#### Multi-Stage Scripts + +Create complex workflows with multiple stages: + +```python +def create_pipeline_script() -> run.Script: + """Create a complete ML pipeline script.""" + + return run.Script( + inline=""" +#!/bin/bash +set -e + +# Stage 1: Data preparation +echo "=== Stage 1: Data Preparation ===" +python prepare_data.py \ + --input /data/raw \ + --output /data/processed \ + --tokenizer llama-tokenizer + +# Stage 2: Model training +echo "=== Stage 2: Model Training ===" +python train.py \ + --model-size 8b \ + --data-path /data/processed \ + --batch-size 512 \ + --learning-rate 0.001 \ + --max-epochs 100 \ + --checkpoint-dir /checkpoints + +# Stage 3: Model evaluation +echo "=== Stage 3: Model Evaluation ===" +python evaluate.py \ + --model-path /checkpoints/best \ + --test-data /data/test \ + --output /results/evaluation.json + +# Stage 4: Model deployment preparation +echo "=== Stage 4: Deployment Preparation ===" +python export_model.py \ + --model-path /checkpoints/best \ + --output /deployment/model.pt \ + --format torchscript + +echo "Pipeline completed successfully!" +""" + ) +``` + +## Configuration Best Practices + +### Type Safety and Validation + +```python +from typing import Optional, Union +from dataclasses import dataclass + +@dataclass +class TrainingConfig: + """Type-safe training configuration.""" + model_size: str + batch_size: int + learning_rate: float + max_epochs: int + use_mixed_precision: bool = True + + def __post_init__(self): + """Validate configuration after initialization.""" + if self.model_size not in ["8b", "70b"]: + raise ValueError(f"Unsupported model size: {self.model_size}") + + if self.batch_size <= 0: + raise ValueError("Batch size must be positive") + + if self.learning_rate <= 0: + raise ValueError("Learning rate must be positive") + + if self.max_epochs <= 0: + raise ValueError("Max epochs must be positive") + +def create_validated_config(config: TrainingConfig) -> run.Config: + """Create NeMo Run configuration from validated config.""" + return run.Config( + TrainingJob, + model=create_model_config(config.model_size), + trainer=run.Config( + Trainer, + batch_size=config.batch_size, + learning_rate=config.learning_rate, + max_epochs=config.max_epochs, + precision="bf16-mixed" if config.use_mixed_precision else "32" + ) + ) +``` + +### Environment-Specific Configurations + +```python +import os + +def get_environment_config() -> run.Config: + """Get configuration based on environment.""" + + env = os.getenv("NEMO_ENV", "development") + + if env == "development": + return run.Config( + TrainingJob, + model=create_model_config("8b"), + trainer=run.Config( + Trainer, + num_nodes=1, + gpus_per_node=1, + batch_size=32, + max_epochs=5 + ) + ) + elif env == "staging": + return run.Config( + TrainingJob, + model=create_model_config("8b"), + trainer=run.Config( + Trainer, + num_nodes=2, + gpus_per_node=4, + batch_size=256, + max_epochs=50 + ) + ) + elif env == "production": + return run.Config( + TrainingJob, + model=create_model_config("70b"), + trainer=run.Config( + Trainer, + num_nodes=8, + gpus_per_node=8, + batch_size=512, + max_epochs=100 + ) + ) + else: + raise ValueError(f"Unknown environment: {env}") +``` + +### Configuration Composition and Reuse + +```python +# Base configurations for reuse +BASE_MODEL_CONFIG = run.Config( + LlamaModel, + config=run.Config( + Llama3Config8B, + hidden_size=4096, + num_attention_heads=32 + ) +) + +BASE_TRAINER_CONFIG = run.Config( + Trainer, + precision="bf16-mixed", + max_epochs=100, + gradient_clip_val=1.0 +) + +# Compose configurations +def create_experiment_config( + model_size: str, + seq_length: int, + batch_size: int +) -> run.Config: + """Create experiment configuration by composing base configs.""" + + # Start with base configurations + model_config = BASE_MODEL_CONFIG.copy() + trainer_config = BASE_TRAINER_CONFIG.copy() + + # Customize model configuration + model_config.config.seq_length = seq_length + if model_size == "70b": + model_config.config = run.Config(Llama3Config70B) + model_config.config.hidden_size = 8192 + model_config.config.num_attention_heads = 64 + + # Customize trainer configuration + trainer_config.batch_size = batch_size + + return run.Config( + TrainingJob, + model=model_config, + trainer=trainer_config + ) +``` + +This comprehensive guide covers all aspects of NeMo Run configuration, from basic usage to advanced patterns and best practices. Use these patterns to create robust, maintainable, and type-safe machine learning experiment configurations. diff --git a/docs/guides/execution.md b/docs/guides/execution.md new file mode 100644 index 00000000..32a974a0 --- /dev/null +++ b/docs/guides/execution.md @@ -0,0 +1,833 @@ +# Execute NeMo Run + +This guide covers how to execute NeMo Run experiments across different computing environments. NeMo Run separates configuration from execution, allowing you to define your task once and run it on various platforms without code changes. + +## Execution Overview + +NeMo Run provides a unified execution framework that abstracts away the complexity of different computing environments. The execution process involves: + +1. **Configuration**: Define your task using `run.Config` or `run.Partial` +2. **Packaging**: Bundle your code and dependencies for remote execution +3. **Launching**: Execute the task using appropriate launchers (torchrun, fault tolerance, etc.) +4. **Management**: Monitor and retrieve results through the experiment interface + +### Key Components + +- **`run.Executor`**: Configures the execution environment and packaging strategy +- **`run.Experiment`**: Manages multiple tasks and provides experiment lifecycle management +- **`run.run()`**: Simple function for single task execution + +> **Important**: NeMo Run requires Docker for remote execution. All remote executors use containerized environments to ensure reproducibility and dependency isolation. + +> **Note**: Experiment metadata is stored in `NEMORUN_HOME` (default: `~/.nemo_run`). Configure this environment variable to control where experiment data is stored. + +## Core Concepts + +For detailed definitions of terms used in this guide, see the [Glossary](../reference/glossary). + +### Execution Units + +An execution unit consists of a task configuration paired with an executor. This separation allows you to: + +- Run the same task on different platforms +- Mix and match tasks and executors +- Scale experiments across multiple environments + +```python +import nemo_run as run + +# Define your task +task_config = run.Config(MyTrainingFunction, learning_rate=0.001, batch_size=32) + +# Choose your executor +executor = run.SlurmExecutor(partition="gpu", nodes=2, gpus_per_node=4) + +# Create execution unit +with run.Experiment("my-experiment") as experiment: + experiment.add(task_config, executor=executor) + experiment.run() +``` + +### Experiment Management + +NeMo Run provides comprehensive experiment management through the `run.Experiment` class: + +```python +# Create experiment with multiple tasks +with run.Experiment("multi-task-experiment") as experiment: + # Add tasks with different configurations + experiment.add( + run.Config(MyModel, model_size="small"), + executor=run.LocalExecutor(), + name="small-model" + ) + + experiment.add( + run.Config(MyModel, model_size="large"), + executor=run.SlurmExecutor(partition="gpu", nodes=4), + name="large-model" + ) + + # Launch all tasks + experiment.run() + + # Monitor progress + for job in experiment.jobs: + print(f"Job {job.id}: {job.state}") + if job.state == run.AppState.SUCCEEDED: + print(f"Job {job.id} completed successfully") + elif job.state == run.AppState.FAILED: + print(f"Job {job.id} failed") +``` + +## Code Packaging + +NeMo Run uses packagers to bundle your code and dependencies for remote execution. Each executor supports different packaging strategies. + +### Packager Support Matrix + +| Executor | Supported Packagers | +|----------|-------------------| +| LocalExecutor | `run.Packager` | +| DockerExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| SlurmExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| SkypilotExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| DGXCloudExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | +| LeptonExecutor | `run.Packager`, `run.GitArchivePackager`, `run.PatternPackager`, `run.HybridPackager` | + +### Packager Types + +#### `run.Packager` (Base Packager) + +A pass-through packager that doesn't modify your code: + +```python +executor = run.LocalExecutor(packager=run.Packager()) +``` + +#### `run.GitArchivePackager` + +Packages your Git repository using `git archive`: + +```python +packager = run.GitArchivePackager( + subpath="src" # Optional: package only a subdirectory +) + +executor = run.SlurmExecutor( + packager=packager, + # ... other parameters +) +``` + +**How it works:** + +1. Determines the Git repository root using `git rev-parse --show-toplevel` +2. Creates a tar.gz archive using `git archive --format=tar.gz` +3. Extracts the archive as the working directory for your job + +**Directory structure example:** + +``` +Repository structure: +├── docs/ +├── src/ +│ └── my_library/ +└── tests/ + +With subpath="src", working directory becomes: +└── my_library/ +``` + +> **Important**: `git archive` only includes committed changes. Uncommitted modifications are not packaged. + +#### `run.PatternPackager` + +Packages files based on pattern matching, useful for non-Git repositories: + +```python +import os + +packager = run.PatternPackager( + include_pattern="src/**", # Include all files under src/ + relative_path=os.getcwd() # Base directory for pattern matching +) + +executor = run.DockerExecutor(packager=packager) +``` + +Pattern matching command: +```bash +cd {relative_path} && find {include_pattern} -type f +``` + +#### `run.HybridPackager` + +Combines multiple packagers into a single archive: + +```python +import os + +hybrid_packager = run.HybridPackager( + sub_packagers={ + "code": run.GitArchivePackager(subpath="src"), + "configs": run.PatternPackager( + include_pattern="configs/*.yaml", + relative_path=os.getcwd() + ), + "data": run.PatternPackager( + include_pattern="data/processed/**", + relative_path=os.getcwd() + ) + } +) + +executor = run.SlurmExecutor(packager=hybrid_packager) +``` + +**Resulting archive structure:** + +``` +archive/ +├── code/ +│ └── my_library/ +├── configs/ +│ ├── model_config.yaml +│ └── data_config.yaml +└── data/ + └── processed/ + └── dataset.parquet +``` + +## Task Launchers + +Launchers determine how your task is executed within the container. They handle distributed training, fault tolerance, and other execution requirements. + +### Available Launchers + +#### Default Launcher (None) + +Direct execution without special launchers: + +```python +executor = run.SlurmExecutor(launcher=None) # Default behavior +``` + +#### torchrun Launcher + +Launches distributed PyTorch training using `torchrun`: + +```python +from nemo_run import Torchrun + +executor = run.SlurmExecutor( + launcher=Torchrun( + nnodes=2, + nproc_per_node=4, + rdzv_backend="c10d", + rdzv_endpoint="localhost:29400" + ) +) +``` + +**Configuration options:** + +- `nnodes`: Number of nodes +- `nproc_per_node`: Processes per node +- `rdzv_backend`: Rendezvous backend (c10d, static, etc.) +- `rdzv_endpoint`: Rendezvous endpoint +- `rdzv_id`: Unique rendezvous ID + +#### Fault Tolerance Launcher + +Uses NVIDIA's fault-tolerant launcher for resilient training: + +```python +from nemo_run.core.execution import FaultTolerance + +executor = run.SlurmExecutor( + launcher=FaultTolerance( + max_restarts=3, + restart_delay=60, + checkpoint_interval=1000 + ) +) +``` + +**Configuration options:** + +- `max_restarts`: Maximum number of restart attempts +- `restart_delay`: Delay between restarts (seconds) +- `checkpoint_interval`: Checkpoint frequency (steps) + +> **Note**: Launchers may not work optimally with `run.Script`. Report issues at the NeMo Run GitHub repository. + +## Executor Types + +### Local Execution + +#### `run.LocalExecutor` + +Executes tasks locally in a separate process: + +```python +executor = run.LocalExecutor( + packager=run.Packager(), + env_vars={"CUDA_VISIBLE_DEVICES": "0"} +) +``` + +**Use cases:** + +- Development and debugging +- Quick testing of configurations +- Local experimentation + +### Containerized Execution + +#### `run.DockerExecutor` + +Executes tasks in Docker containers on your local machine: + +```python +executor = run.DockerExecutor( + container_image="nvidia/cuda:11.8-devel-ubuntu20.04", + num_gpus=4, + runtime="nvidia", + ipc_mode="host", + shm_size="32g", + volumes=[ + "/local/data:/container/data", + "/local/models:/container/models" + ], + env_vars={ + "PYTHONUNBUFFERED": "1", + "CUDA_VISIBLE_DEVICES": "0,1,2,3" + }, + packager=run.GitArchivePackager() +) +``` + +**Key parameters:** + +- `container_image`: Docker image with required dependencies +- `num_gpus`: Number of GPUs to allocate (-1 for all available) +- `runtime`: Docker runtime (nvidia for GPU support) +- `volumes`: Host-to-container volume mappings +- `env_vars`: Environment variables for the container + +### High-Performance Computing + +#### `run.SlurmExecutor` + +Executes tasks on Slurm clusters with container support via Pyxis: + +```python +def create_slurm_executor( + nodes: int = 1, + gpus_per_node: int = 8, + container_image: str = "nvidia/cuda:11.8-devel-ubuntu20.04" +): + # SSH tunnel for remote execution + ssh_tunnel = run.SSHTunnel( + host="cluster.login.node", + user="username", + job_dir="/home/username/nemo-run-experiments", + identity="~/.ssh/id_rsa" + ) + + # Local tunnel for execution from login node + local_tunnel = run.LocalTunnel() + + packager = run.GitArchivePackager( + subpath="src" # Package only the src directory + ) + + return run.SlurmExecutor( + # Slurm-specific parameters + account="ml_research", + partition="gpu", + nodes=nodes, + ntasks_per_node=8, + gpus_per_node=gpus_per_node, + cpus_per_node=32, + memory_per_node="128G", + time="24:00:00", + + # Container configuration + container_image=container_image, + container_mounts=[ + "/shared/data:/data", + "/shared/models:/models" + ], + + # Execution configuration + tunnel=ssh_tunnel, # Use local_tunnel if on login node + packager=packager, + launcher=Torchrun(nnodes=nodes, nproc_per_node=gpus_per_node), + + # Environment variables + env_vars={ + "NCCL_DEBUG": "INFO", + "NCCL_IB_DISABLE": "0", + "PYTHONUNBUFFERED": "1" + } + ) + +# Usage +executor = create_slurm_executor(nodes=4, gpus_per_node=8) +``` + +#### Job Dependencies + +Create workflow dependencies between Slurm jobs: + +```python +# Data preparation job +data_job = run.submit( + run.Config(PrepareData, dataset="wikitext-103"), + run.SlurmExecutor(partition="cpu", time="02:00:00") +) + +# Training job that depends on data preparation +training_job = run.submit( + run.Config(TrainModel, dataset="wikitext-103"), + run.SlurmExecutor( + partition="gpu", + nodes=4, + gpus_per_node=8, + dependency_type="afterok", # Start after data job succeeds + dependencies=[data_job.id] + ) +) +``` + +**Dependency types:** + +- `afterok` (default): Start after dependency jobs complete successfully +- `afterany`: Start after dependency jobs terminate (any exit code) +- `afternotok`: Start after dependency jobs fail +- `aftercorr`: Start after dependency jobs are cancelled + +### Cloud Execution + +#### `run.SkypilotExecutor` + +Executes tasks on cloud platforms using SkyPilot: + +```python +def create_skypilot_executor( + nodes: int = 1, + gpus_per_node: int = 8, + container_image: str = "nvidia/cuda:11.8-devel-ubuntu20.04" +): + return run.SkypilotExecutor( + # Resource specification + gpus="A100-80GB", # GPU type + gpus_per_node=gpus_per_node, + nodes=nodes, + + # Container configuration + container_image=container_image, + + # Cloud configuration + cloud="aws", # or "gcp", "azure", "kubernetes" + region="us-west-2", + + # Optional cluster reuse + cluster_name="nemo-training-cluster", + + # Setup commands + setup=""" + # Install additional dependencies + pip install transformers datasets + + # Verify GPU availability + nvidia-smi + + # Check working directory + ls -la ./ + """, + + # Environment variables + env_vars={ + "PYTHONUNBUFFERED": "1", + "CUDA_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7" + }, + + # Packaging + packager=run.GitArchivePackager() + ) + +# Usage +executor = create_skypilot_executor(nodes=2, gpus_per_node=8) +``` + +**Prerequisites:** + +```bash +# Install SkyPilot support +pip install "nemo_run[skypilot]" + +# Configure cloud credentials +sky check +``` + +#### `run.DGXCloudExecutor` + +Executes tasks on NVIDIA DGX Cloud using Run:ai API: + +```python +def create_dgx_executor( + nodes: int = 1, + gpus_per_node: int = 8, + container_image: str = "nvidia/cuda:11.8-devel-ubuntu20.04" +): + return run.DGXCloudExecutor( + # API configuration + base_url="https://your-cluster.domain.com/api/v1", + app_id="your-runai-app-id", + app_secret="your-runai-app-secret", + project_name="your-project", + + # Resource configuration + nodes=nodes, + gpus_per_node=gpus_per_node, + + # Container configuration + container_image=container_image, + + # Storage configuration + pvcs=[ + { + "name": "nemo-data-pvc", + "path": "/workspace/data" + } + ], + + # Environment variables + env_vars={ + "PYTHONUNBUFFERED": "1", + "NEMORUN_HOME": "/workspace/nemo-run" + }, + + # Packaging + packager=run.GitArchivePackager() + ) +``` + +> **Warning**: DGXCloudExecutor currently only supports launching from pods running on the DGX Cloud cluster itself. The launching pod must have access to a Persistent Volume Claim (PVC) for experiment storage. + +#### `run.LeptonExecutor` + +Executes tasks on NVIDIA DGX Cloud Lepton clusters: + +```python +def create_lepton_executor( + nodes: int = 1, + gpus_per_node: int = 8, + container_image: str = "nvidia/cuda:11.8-devel-ubuntu20.04" +): + return run.LeptonExecutor( + # Resource configuration + resource_shape="gpu.8xh100-80gb", # Resource shape per node + node_group="training-nodes", + nodes=nodes, + gpus_per_node=gpus_per_node, + + # Container configuration + container_image=container_image, + + # Storage configuration + nemo_run_dir="/workspace/nemo-run", + mounts=[ + { + "path": "/workspace/data", + "mount_path": "/workspace/data" + } + ], + + # Environment variables + env_vars={ + "PYTHONUNBUFFERED": "1", + "NEMORUN_HOME": "/workspace/nemo-run" + }, + + # Packaging + packager=run.GitArchivePackager() + ) +``` + +## Advanced Execution Features + +### Multi-Environment Execution + +Run the same task across different environments: + +```python +# Define task once +task_config = run.Config( + TrainModel, + model_size="1.3B", + dataset="wikitext-103", + learning_rate=0.0001 +) + +# Create executors for different environments +executors = { + "local": run.LocalExecutor(), + "docker": run.DockerExecutor( + container_image="nvidia/cuda:11.8-devel-ubuntu20.04", + num_gpus=2 + ), + "slurm": run.SlurmExecutor( + partition="gpu", + nodes=2, + gpus_per_node=4 + ), + "cloud": run.SkypilotExecutor( + gpus="A100-80GB", + nodes=2, + gpus_per_node=4 + ) +} + +# Launch experiments +experiments = {} +for name, executor in executors.items(): + experiments[name] = run.submit( + task_config, + executor, + metadata={"environment": name} + ) +``` + +### Resource Optimization + +Dynamically configure resources based on task requirements: + +```python +def create_adaptive_executor(task_config): + """Create executor with resources optimized for the task.""" + + # Analyze task requirements + model_size = task_config.model_size + batch_size = task_config.batch_size + + # Calculate optimal resources + if model_size == "small" and batch_size <= 32: + return run.SlurmExecutor( + partition="gpu", + nodes=1, + gpus_per_node=2, + memory_per_node="64G" + ) + elif model_size == "medium" and batch_size <= 128: + return run.SlurmExecutor( + partition="gpu", + nodes=2, + gpus_per_node=4, + memory_per_node="128G" + ) + else: + return run.SlurmExecutor( + partition="gpu", + nodes=4, + gpus_per_node=8, + memory_per_node="256G" + ) + +# Usage +executor = create_adaptive_executor(task_config) +experiment = run.submit(task_config, executor) +``` + +### Fault Tolerance and Recovery + +Implement robust execution with automatic recovery: + +```python +# Configure fault-tolerant launcher +fault_tolerant_launcher = FaultTolerance( + max_restarts=5, + restart_delay=120, + checkpoint_interval=500, + checkpoint_dir="/workspace/checkpoints" +) + +# Use with any executor +executor = run.SlurmExecutor( + partition="gpu", + nodes=4, + gpus_per_node=8, + launcher=fault_tolerant_launcher, + time="48:00:00" # Long time for fault tolerance +) + +# Submit with automatic recovery +experiment = run.submit(task_config, executor) +``` + +## Best Practices + +### Configuration Management + +Use environment-specific configurations: + +```python +def get_executor_for_environment(env: str): + if env == "development": + return run.LocalExecutor() + elif env == "staging": + return run.DockerExecutor(container_image="staging-image") + elif env == "production": + return run.SlurmExecutor(partition="production-gpu") + else: + raise ValueError(f"Unknown environment: {env}") +``` + +Parameterize executor creation: + +```python +def create_executor_factory( + base_image: str = "nvidia/cuda:11.8-devel-ubuntu20.04", + default_gpus: int = 4 +): + def create_executor(nodes: int, gpus_per_node: int = None): + if gpus_per_node is None: + gpus_per_node = default_gpus + + return run.SlurmExecutor( + container_image=base_image, + nodes=nodes, + gpus_per_node=gpus_per_node, + partition="gpu" + ) + + return create_executor + +# Usage +factory = create_executor_factory() +executor = factory(nodes=2) +``` + +### Resource Management + +Monitor resource usage: + +```python +# Check cluster status before submission +executor = run.SlurmExecutor(partition="gpu") +status = executor.get_cluster_status() +print(f"Available nodes: {status.available_nodes}") +print(f"Queue length: {status.queue_length}") + +# Only submit if resources are available +if status.available_nodes >= 2: + experiment = run.submit(task_config, executor) +else: + print("Insufficient resources, waiting...") +``` + +Use resource quotas: + +```python +executor = run.SlurmExecutor( + partition="gpu", + qos="high_priority", # Quality of service + account="ml_research", # Account/charge code + exclusive=True # Exclusive node access +) +``` + +### Error Handling + +Implement comprehensive error handling: + +```python +try: + experiment = run.submit(task_config, executor) + + # Wait for completion with timeout + experiment.wait(timeout=3600) # 1 hour timeout + + if experiment.failed: + logs = run.get_logs(experiment) + print(f"Experiment failed: {logs.stderr}") + + # Implement retry logic + if experiment.retry_count < 3: + experiment.retry() + +except Exception as e: + print(f"Execution failed: {e}") + # Implement fallback strategy +``` + +Validate configurations before execution: + +```python +def validate_executor_config(executor): + """Validate executor configuration before submission.""" + + if isinstance(executor, run.SlurmExecutor): + # Check if partition exists + partitions = executor.list_partitions() + if executor.partition not in partitions: + raise ValueError(f"Partition {executor.partition} not found") + + # Check resource availability + if executor.nodes > executor.get_max_nodes(): + raise ValueError(f"Requested {executor.nodes} nodes, max available: {executor.get_max_nodes()}") + + return True + +# Usage +validate_executor_config(executor) +experiment = run.submit(task_config, executor) +``` + +### Performance Optimization + +Optimize container images: + +```dockerfile +# Use multi-stage builds for smaller images +FROM nvidia/cuda:11.8-devel-ubuntu20.04 as base + +# Install only necessary dependencies +RUN apt-get update && apt-get install -y \ + python3.9 \ + python3-pip \ + && rm -rf /var/lib/apt/lists/* + +# Install Python dependencies +COPY requirements.txt . +RUN pip install -r requirements.txt + +# Final stage +FROM base as runtime +WORKDIR /workspace +CMD ["python3"] +``` + +Use efficient packaging strategies: + +```python +# For development with frequent changes +dev_packager = run.PatternPackager( + include_pattern="src/**", + relative_path=os.getcwd() +) + +# For production with version control +prod_packager = run.GitArchivePackager( + subpath="src" +) + +# Choose based on environment +packager = dev_packager if is_development else prod_packager +executor = run.SlurmExecutor(packager=packager) +``` + +This comprehensive guide covers all aspects of NeMo Run execution, from basic usage to advanced features and best practices. Use these patterns to build robust, scalable machine learning workflows across different computing environments. diff --git a/docs/guides/index.md b/docs/guides/index.md new file mode 100644 index 00000000..5847d3a0 --- /dev/null +++ b/docs/guides/index.md @@ -0,0 +1,94 @@ +--- +description: "Comprehensive guides for NeMo Run features including configuration, execution, and management." +tags: ["guides", "configuration", "execution", "management", "tutorials"] +categories: ["guides"] +--- + +(guides)= + +# NeMo Run Guides + +Welcome to the NeMo Run guides. These comprehensive guides will help you master the core features and capabilities of NeMo Run for ML experiment management. + +## Guides + +Explore the topics below to learn how to set up, customize, and optimize your machine learning experiments with NeMo Run. + +::::{grid} 1 1 2 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration +:link: configuration +:link-type: doc +:link-alt: Configuration guide + +Learn how to configure your ML experiments with type-safe, flexible configuration management. +::: + +:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Execution +:link: execution +:link-type: doc +:link-alt: Execution guide + +Execute your experiments across local, Docker, Slurm, Kubernetes, and cloud environments. +::: + +:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Management +:link: management +:link-type: doc +:link-alt: Management guide + +Manage and monitor your experiments with comprehensive tracking and reproducibility. +::: + +:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Ray Clusters and Jobs +:link: ray +:link-type: doc +:link-alt: Deploy Ray Clusters and Jobs + +Deploy and manage Ray clusters and jobs for scalable distributed computing. +::: + +:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Packaging Strategies +:link: packaging +:link-type: doc +:link-alt: NeMo Run Packaging Strategies + +Deploy your code using Git archives, pattern matching, or hybrid packaging strategies. +::: + +:::: + +## Get Started + +If you're new to NeMo Run, we recommend following these guides in order: + +1. **Configuration** - Start here to understand how to configure your experiments +2. **Execution** - Learn how to run your configured experiments +3. **Management** - Discover how to track and manage your experiments +4. **Packaging Strategies** - Learn how to package your code for remote execution +5. **Ray Clusters and Jobs** - Learn distributed computing with Ray (optional) + +## What You'll Learn + +Each guide provides: + +- **Step-by-step instructions** with practical examples +- **Code samples** that you can run immediately +- **Best practices** for production use +- **Troubleshooting tips** for common issues +- **Advanced features** for power users + +## Prerequisites + +Before diving into these guides, make sure you have: + +- NeMo Run installed (see [Installation Guide](../get-started/install)) +- Basic Python knowledge +- Access to computing resources (local, cloud, or cluster) + +## Need Help? + +- Check the [FAQs](../reference/faqs) for common questions +- Explore the [About section](../about/index) for conceptual information +- Review the [tutorials](../get-started/tutorials) for hands-on examples diff --git a/docs/guides/management.md b/docs/guides/management.md new file mode 100644 index 00000000..41f44b5a --- /dev/null +++ b/docs/guides/management.md @@ -0,0 +1,655 @@ +# Manage NeMo Run Experiments + +NeMo Run provides a comprehensive experiment management system centered around the `Experiment` class. This system enables you to define, launch, monitor, and manage complex machine learning workflows with multiple interdependent tasks. This guide covers all aspects of experiment lifecycle management. + +## Experiment Management Overview + +The `Experiment` class serves as the central orchestrator for managing multi-task workflows in NeMo Run. It provides: + +- **Task Orchestration**: Define and manage multiple tasks with dependencies +- **Execution Control**: Launch, monitor, and control experiment execution +- **Metadata Tracking**: Automatic tracking of experiment metadata and artifacts +- **Logging and Monitoring**: Real-time log access and status monitoring +- **Reproducibility**: Complete experiment state capture for reproducibility + +## Creating and Configuring Experiments + +### Basic Experiment Creation + +Create an experiment with a descriptive title: + +```python +import nemo_run as run + +# Create a simple experiment +experiment = run.Experiment("transformer-finetuning") + +# Create with additional metadata +experiment = run.Experiment( + "llama3-pretraining", + description="Pre-training Llama3 8B model on custom dataset", + tags=["llama3", "pretraining", "8b"], + metadata={ + "dataset": "wikitext-103", + "model_size": "8B", + "framework": "NeMo" + } +) +``` + +### Experiment Configuration + +Configure experiment behavior and logging: + +```python +experiment = run.Experiment( + "distributed-training", + log_level="INFO", # DEBUG, INFO, WARNING, ERROR + max_retries=3, # Number of retry attempts for failed tasks + timeout=3600, # Global timeout in seconds + checkpoint_interval=300, # Checkpoint frequency in seconds + metadata={ + "project": "ml-research", + "team": "ai-team", + "priority": "high" + } +) +``` + +## Task Management + +### Adding Individual Tasks + +Add single tasks to your experiment: + +```python +# Define your task configurations +model_config = run.Config( + TrainModel, + model_size="8b", + learning_rate=0.001, + batch_size=32 +) + +data_config = run.Config( + PreprocessData, + input_path="/data/raw", + output_path="/data/processed" +) + +# Create experiment and add tasks +with run.Experiment("ml-pipeline") as exp: + # Add training task + training_id = exp.add( + model_config, + executor=run.SlurmExecutor( + partition="gpu", + nodes=2, + gpus_per_node=4 + ), + name="model-training" + ) + + # Add data preprocessing task + data_id = exp.add( + data_config, + executor=run.LocalExecutor(), + name="data-preprocessing" + ) +``` + +### Adding Task Groups + +Add multiple tasks that execute in parallel: + +```python +# Define multiple model configurations +model_configs = [ + run.Config(TrainModel, model_size="8b", learning_rate=0.001), + run.Config(TrainModel, model_size="8b", learning_rate=0.0001), + run.Config(TrainModel, model_size="70b", learning_rate=0.001) +] + +# Create experiment with parallel tasks +with run.Experiment("hyperparameter-sweep") as exp: + # Add all model configurations to run in parallel + task_ids = exp.add( + model_configs, + executor=run.SlurmExecutor( + partition="gpu", + nodes=1, + gpus_per_node=8 + ), + name="model-variants" + ) + + # Add evaluation task that depends on all training tasks + exp.add( + run.Config(EvaluateModels, model_paths="/checkpoints/*"), + executor=run.LocalExecutor(), + name="model-evaluation", + dependencies=task_ids # Wait for all training tasks to complete + ) +``` + +### Task Dependencies and Workflows + +Create complex workflows with task dependencies: + +```python +def create_ml_pipeline(): + """Create a complete ML pipeline with dependencies.""" + + with run.Experiment("complete-ml-pipeline") as exp: + # Stage 1: Data preparation + data_prep_id = exp.add( + run.Config(PrepareData, dataset="wikitext-103"), + executor=run.LocalExecutor(), + name="data-preparation" + ) + + # Stage 2: Model training (depends on data preparation) + training_id = exp.add( + run.Config(TrainModel, data_path="/data/processed"), + executor=run.SlurmExecutor(partition="gpu", nodes=4), + name="model-training", + dependencies=[data_prep_id] + ) + + # Stage 3: Model evaluation (depends on training) + eval_id = exp.add( + run.Config(EvaluateModel, model_path="/checkpoints/best"), + executor=run.LocalExecutor(), + name="model-evaluation", + dependencies=[training_id] + ) + + # Stage 4: Model deployment (depends on evaluation) + deploy_id = exp.add( + run.Config(DeployModel, model_path="/checkpoints/best"), + executor=run.DockerExecutor(), + name="model-deployment", + dependencies=[eval_id] + ) + + return exp + +# Usage +pipeline = create_ml_pipeline() +``` + +### Using Plugins + +Plugins allow you to modify tasks and executors together: + +```python +# Define a custom plugin +class MixedPrecisionPlugin(run.Plugin): + """Plugin to enable mixed precision training.""" + + def modify_task(self, task): + """Modify task to use mixed precision.""" + if hasattr(task, 'precision'): + task.precision = "bf16-mixed" + return task + + def modify_executor(self, executor): + """Modify executor with mixed precision environment variables.""" + if not hasattr(executor, 'env_vars'): + executor.env_vars = {} + executor.env_vars.update({ + "NCCL_P2P_DISABLE": "1", + "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128" + }) + return executor + +# Use plugin with tasks +with run.Experiment("mixed-precision-training") as exp: + exp.add( + run.Config(TrainModel, model_size="8b"), + executor=run.SlurmExecutor(partition="gpu"), + plugins=[MixedPrecisionPlugin()], + name="training-with-mixed-precision" + ) +``` + +## Experiment Execution + +### Launching Experiments + +Launch experiments with different execution modes: + +```python +with run.Experiment("my-experiment") as exp: + # Add tasks... + + # Launch with different options + exp.run( + detach=False, # Stay attached to monitor progress + sequential=False, # Execute tasks in parallel where possible + tail_logs=True, # Show real-time logs + direct=False # Use remote execution + ) +``` + +### Execution Modes + +#### Attached Execution (Default) + +Monitor experiment progress in real-time: + +```python +with run.Experiment("monitored-experiment") as exp: + exp.add(task_config, executor=executor) + + # Launch and monitor + exp.run( + detach=False, # Stay attached + tail_logs=True # Show logs in real-time + ) + + # Check final status + print(f"Experiment completed with status: {exp.status()}") +``` + +#### Detached Execution + +Launch experiments and detach for long-running tasks: + +```python +with run.Experiment("long-running-experiment") as exp: + exp.add(task_config, executor=executor) + + # Launch and detach + exp.run(detach=True) + + # Save experiment ID for later monitoring + experiment_id = exp.experiment_id + print(f"Experiment launched with ID: {experiment_id}") + + # Later, retrieve and monitor + retrieved_exp = run.get_experiment(experiment_id) + print(retrieved_exp.status()) +``` + +#### Sequential Execution + +Execute tasks one after another: + +```python +with run.Experiment("sequential-experiment") as exp: + exp.add([task1, task2, task3], executor=executor) + + # Execute sequentially + exp.run(sequential=True) +``` + +#### Direct Execution + +Execute tasks directly in the current process: + +```python +with run.Experiment("direct-execution") as exp: + exp.add(task_config, executor=executor) + + # Execute directly (no remote execution) + exp.run(direct=True) +``` + +## Monitoring and Status + +### Checking Experiment Status + +Monitor experiment and task status: + +```python +# Check overall experiment status +status = experiment.status() +print(f"Experiment status: {status}") + +# Get detailed task information +for task in experiment.tasks: + print(f"Task {task.name}: {task.status}") + print(f" - Job ID: {task.job_id}") + print(f" - Executor: {task.executor}") + print(f" - Directory: {task.local_directory}") + print(f" - Start time: {task.start_time}") + print(f" - End time: {task.end_time}") +``` + +### Real-Time Monitoring + +Monitor experiments in real-time: + +```python +import time + +def monitor_experiment(experiment, check_interval=30): + """Monitor experiment progress in real-time.""" + + print(f"Monitoring experiment: {experiment.experiment_id}") + + while True: + status = experiment.status() + print(f"\nStatus at {time.strftime('%H:%M:%S')}:") + + for task in experiment.tasks: + print(f" {task.name}: {task.status}") + + # Check if all tasks are complete + if all(task.status in ['SUCCEEDED', 'FAILED', 'CANCELLED'] + for task in experiment.tasks): + print("\nExperiment completed!") + break + + time.sleep(check_interval) + +# Usage +with run.Experiment("monitored-experiment") as exp: + exp.add(task_config, executor=executor) + exp.run(detach=False) + monitor_experiment(exp) +``` + +### Task Control + +Control individual tasks: + +```python +# Cancel a specific task +experiment.cancel("task_id") + +# Cancel all tasks +experiment.cancel_all() + +# Retry a failed task +experiment.retry("task_id") + +# Pause/resume tasks +experiment.pause("task_id") +experiment.resume("task_id") +``` + +## Logging and Debugging + +### Accessing Task Logs + +Retrieve and analyze task logs: + +```python +# Get logs for a specific task +task_logs = experiment.get_logs("task_id") +print(f"Exit code: {task_logs.exit_code}") +print(f"Stdout: {task_logs.stdout}") +print(f"Stderr: {task_logs.stderr}") + +# Stream logs in real-time +for log_entry in experiment.stream_logs("task_id"): + print(f"{log_entry.timestamp}: {log_entry.message}") + +# Get logs for all tasks +all_logs = experiment.get_all_logs() +for task_name, logs in all_logs.items(): + print(f"\n=== {task_name} ===") + print(f"Exit code: {logs.exit_code}") + if logs.stderr: + print(f"Errors: {logs.stderr}") +``` + +### Log Analysis + +Analyze logs for debugging and monitoring: + +```python +import re +from typing import Dict, List + +def analyze_experiment_logs(experiment) -> Dict[str, Dict]: + """Analyze logs from all tasks in an experiment.""" + + analysis = {} + + for task in experiment.tasks: + logs = experiment.get_logs(task.id) + + # Extract metrics + metrics = {} + if logs.stdout: + # Extract loss values + loss_pattern = r"loss: (\d+\.\d+)" + losses = re.findall(loss_pattern, logs.stdout) + if losses: + metrics['losses'] = [float(l) for l in losses] + metrics['final_loss'] = float(losses[-1]) + + # Extract accuracy values + acc_pattern = r"accuracy: (\d+\.\d+)" + accuracies = re.findall(acc_pattern, logs.stdout) + if accuracies: + metrics['accuracies'] = [float(a) for a in accuracies] + metrics['final_accuracy'] = float(accuracies[-1]) + + # Extract errors + errors = [] + if logs.stderr: + error_lines = logs.stderr.split('\n') + errors = [line.strip() for line in error_lines if line.strip()] + + analysis[task.name] = { + 'status': task.status, + 'exit_code': logs.exit_code, + 'metrics': metrics, + 'errors': errors, + 'duration': task.end_time - task.start_time if task.end_time else None + } + + return analysis + +# Usage +analysis = analyze_experiment_logs(experiment) +for task_name, data in analysis.items(): + print(f"\n=== {task_name} Analysis ===") + print(f"Status: {data['status']}") + print(f"Exit code: {data['exit_code']}") + if data['metrics']: + print(f"Final loss: {data['metrics'].get('final_loss', 'N/A')}") + print(f"Final accuracy: {data['metrics'].get('final_accuracy', 'N/A')}") + if data['errors']: + print(f"Errors: {len(data['errors'])} found") +``` + +## Experiment Metadata and Artifacts + +### Metadata Management + +Track and retrieve experiment metadata: + +```python +# Add metadata during creation +experiment = run.Experiment( + "hyperparameter-sweep", + metadata={ + "dataset": "wikitext-103", + "model_family": "llama", + "sweep_type": "learning_rate", + "num_trials": 10 + } +) + +# Add metadata after creation +experiment.add_metadata({ + "completed_at": time.time(), + "best_accuracy": 0.95, + "best_model_path": "/checkpoints/best" +}) + +# Retrieve metadata +metadata = experiment.get_metadata() +print(f"Dataset: {metadata.get('dataset')}") +print(f"Best accuracy: {metadata.get('best_accuracy')}") +``` + +### Artifact Management + +Track and retrieve experiment artifacts: + +```python +# Add artifacts +experiment.add_artifact( + "best_model", + "/checkpoints/best_model.pt", + description="Best performing model checkpoint" +) + +experiment.add_artifact( + "training_logs", + "/logs/training.log", + description="Complete training logs" +) + +# Retrieve artifacts +artifacts = experiment.get_artifacts() +for name, artifact in artifacts.items(): + print(f"{name}: {artifact.path}") + print(f" Description: {artifact.description}") + print(f" Size: {artifact.size} bytes") +``` + +## Experiment Reproducibility + +### Experiment Snapshots + +Create reproducible experiment snapshots: + +```python +# Create a snapshot of the current experiment state +snapshot = experiment.create_snapshot() + +# Save snapshot to file +snapshot.save("/path/to/snapshot.json") + +# Load and reproduce experiment +loaded_snapshot = run.ExperimentSnapshot.load("/path/to/snapshot.json") +reproduced_experiment = loaded_snapshot.reproduce() +``` + +### Configuration Tracking + +Track configuration changes: + +```python +# Track configuration versions +experiment.track_config_version( + "model_config", + model_config, + description="Initial model configuration" +) + +# Update configuration +model_config.learning_rate = 0.0005 +experiment.track_config_version( + "model_config", + model_config, + description="Updated learning rate" +) + +# Retrieve configuration history +config_history = experiment.get_config_history("model_config") +for version, config in config_history.items(): + print(f"Version {version}: {config.description}") +``` + +## Best Practices + +### Experiment Organization + +```python +def create_organized_experiment( + project_name: str, + experiment_type: str, + model_size: str +) -> run.Experiment: + """Create a well-organized experiment with consistent naming.""" + + # Create descriptive name + experiment_name = f"{project_name}-{experiment_type}-{model_size}" + + # Add comprehensive metadata + metadata = { + "project": project_name, + "experiment_type": experiment_type, + "model_size": model_size, + "created_by": os.getenv("USER", "unknown"), + "created_at": time.time(), + "git_commit": get_git_commit_hash(), + "environment": os.getenv("NEMO_ENV", "development") + } + + return run.Experiment(experiment_name, metadata=metadata) + +# Usage +exp = create_organized_experiment( + project_name="llama-finetuning", + experiment_type="supervised", + model_size="8b" +) +``` + +### Error Handling and Recovery + +```python +def run_experiment_with_recovery(experiment_config, max_retries=3): + """Run experiment with automatic error recovery.""" + + for attempt in range(max_retries): + try: + with run.Experiment("recovery-experiment") as exp: + exp.add(experiment_config, executor=executor) + exp.run() + + # Check for failures + failed_tasks = [task for task in exp.tasks if task.status == 'FAILED'] + if failed_tasks: + print(f"Attempt {attempt + 1} failed with {len(failed_tasks)} failed tasks") + if attempt < max_retries - 1: + print("Retrying...") + continue + else: + raise Exception("Max retries exceeded") + + print("Experiment completed successfully!") + return exp + + except Exception as e: + print(f"Attempt {attempt + 1} failed: {e}") + if attempt == max_retries - 1: + raise +``` + +### Resource Management + +```python +def create_resource_aware_experiment(resource_constraints): + """Create experiment that respects resource constraints.""" + + with run.Experiment("resource-aware-experiment") as exp: + # Check available resources + available_gpus = get_available_gpus() + available_memory = get_available_memory() + + # Adjust configuration based on resources + if available_gpus < resource_constraints['min_gpus']: + raise ValueError("Insufficient GPU resources") + + # Create adaptive executor + executor = run.SlurmExecutor( + partition="gpu", + nodes=min(resource_constraints['max_nodes'], available_gpus // 8), + gpus_per_node=min(8, available_gpus) + ) + + exp.add(task_config, executor=executor) + return exp +``` + +This comprehensive guide covers all aspects of NeMo Run experiment management, from basic usage to advanced monitoring and reproducibility features. Use these patterns to build robust, maintainable, and reproducible machine learning workflows. diff --git a/docs/guides/packaging.md b/docs/guides/packaging.md new file mode 100644 index 00000000..5c73de05 --- /dev/null +++ b/docs/guides/packaging.md @@ -0,0 +1,620 @@ +--- +description: "Complete guide to NeMo Run packaging strategies including GitArchive, Pattern, and Hybrid packagers for code deployment." +tags: ["packaging", "deployment", "code", "archives", "remote-execution"] +categories: ["guides"] +--- + +(packaging)= + +# NeMo Run Packaging Strategies + +NeMo Run provides flexible packaging strategies to deploy your code to remote execution environments. Understanding these packaging options is crucial for ensuring your experiments run correctly across different computing environments. + +## Overview + +Packaging determines how your local code is transferred to remote execution environments. NeMo Run supports multiple packaging strategies: + +- **Base Packager**: Simple pass-through packaging +- **Git Archive Packager**: Version-controlled code packaging +- **Pattern Packager**: File pattern-based packaging +- **Hybrid Packager**: Combine multiple packaging strategies + +## Packaging Support Matrix + +| Executor | Supported Packagers | +|----------|-------------------| +| LocalExecutor | `run.Packager` | +| DockerExecutor | All packagers | +| SlurmExecutor | All packagers | +| SkypilotExecutor | All packagers | +| DGXCloudExecutor | All packagers | +| LeptonExecutor | All packagers | + +## Base Packager + +The `run.Packager` is a simple pass-through packager that doesn't perform any special packaging operations. + +```python +import nemo_run as run + +# Simple passthrough packager +packager = run.Packager() + +executor = run.DockerExecutor( + container_image="pytorch/pytorch:latest", + packager=packager +) +``` + +**Use Cases:** + +- When your code is already available in the container +- For simple scripts that don't require complex packaging +- When using pre-built images with your code + +## Git Archive Packager + +The `run.GitArchivePackager` uses `git archive` to package version-controlled code, ensuring only committed changes are deployed. + +### How It Works + +1. **Base Path Detection**: Uses `git rev-parse --show-toplevel` to find the repository root +2. **Subpath (sub-directory path) Configuration**: Optionally defines a subpath within the repository +3. **Archive Creation**: Creates a tar.gz archive of the specified code +4. **Working Directory**: The extracted archive becomes the working directory for your job + +### Basic Usage + +```python +import nemo_run as run + +# Package the entire repository +packager = run.GitArchivePackager() + +# Package a specific subdirectory +packager = run.GitArchivePackager(subpath="src") + +executor = run.SlurmExecutor( + account="my_account", + partition="gpu", + packager=packager +) +``` + +### Directory Structure Examples + +**Repository Structure:** + +``` +my_project/ +├── docs/ +├── src/ +│ ├── models/ +│ ├── data/ +│ └── utils/ +├── tests/ +├── configs/ +└── README.md +``` + +**With `subpath="src"`:** + +```python +packager = run.GitArchivePackager(subpath="src") +``` + +**Working directory on remote:** + +``` +models/ +data/ +utils/ +``` + +**With `subpath=""` (default):** + +```python +packager = run.GitArchivePackager() # or subpath="" +``` + +**Working directory on remote:** + +``` +docs/ +src/ +tests/ +configs/ +README.md +``` + +### Advanced Configuration + +```python +import nemo_run as run + +# Custom subpath and working directory +packager = run.GitArchivePackager( + subpath="ml_experiments", # Package from ml_experiments/ + working_dir="/workspace" # Extract to /workspace on remote +) + +# Package specific branches or commits +packager = run.GitArchivePackager( + subpath="src", + ref="feature/new-model" # Git reference (branch, tag, commit) +) +``` + +### Best Practices + +1. **Commit Your Changes** + + ```bash + # Always commit before running remote jobs + git add . + git commit -m "Update model configuration" + ``` + +2. **Use Meaningful Subpaths (sub-directory paths)** + + ```python + # Good: Clear subpath + packager = run.GitArchivePackager(subpath="experiments/transformer") + + # Avoid: Too broad + packager = run.GitArchivePackager() # Packages everything + ``` + +3. **Handle Large Repositories** + + ```python + # Package only necessary components + packager = run.GitArchivePackager(subpath="src/models") + ``` + +### Limitations + +- **Uncommitted Changes**: `git archive` doesn't include uncommitted changes +- **Git Dependencies**: Requires a Git repository +- **Archive Size**: Large repositories create large archives + +## Pattern Packager + +The `run.PatternPackager` uses file patterns to package code that may not be under version control or when you need fine-grained control over what gets packaged. + +### How It Works + +1. **Pattern Matching**: Uses `find` command with specified patterns +2. **File Selection**: Includes only files matching the patterns +3. **Relative Paths**: Maintains relative directory structure +4. **Archive Creation**: Creates a tar.gz archive of matched files + +### Basic Usage + +```python +import nemo_run as run +import os + +# Package all Python files in current directory +packager = run.PatternPackager( + include_pattern="*.py", + relative_path=os.getcwd() +) + +# Package specific directories +packager = run.PatternPackager( + include_pattern="src/**", + relative_path=os.getcwd() +) +``` + +### Pattern Examples + +```python +import nemo_run as run +import os + +# Package Python files only +packager = run.PatternPackager( + include_pattern="*.py", + relative_path=os.getcwd() +) + +# Package entire src directory +packager = run.PatternPackager( + include_pattern="src/**", + relative_path=os.getcwd() +) + +# Package multiple patterns +packager = run.PatternPackager( + include_pattern="src/**/*.py configs/*.yaml", + relative_path=os.getcwd() +) + +# Package with exclusions +packager = run.PatternPackager( + include_pattern="src/**/*.py", + exclude_pattern="src/**/*_test.py", + relative_path=os.getcwd() +) + +# Package from different base directory +packager = run.PatternPackager( + include_pattern="**/*.py", + relative_path="/path/to/project" +) +``` + +### Advanced Configuration + +```python +import nemo_run as run +import os + +# Complex pattern matching +packager = run.PatternPackager( + include_pattern="src/**/*.py models/**/*.py configs/*.yaml", + exclude_pattern="**/*_test.py **/__pycache__/**", + relative_path=os.getcwd(), + working_dir="/workspace/code" # Custom working directory +) + +# Package with custom archive name +packager = run.PatternPackager( + include_pattern="src/**", + relative_path=os.getcwd(), + archive_name="my_experiment_code.tar.gz" +) +``` + +### Use Cases + +1. **Non-Git Projects** + + ```python + # Package code not under version control + packager = run.PatternPackager( + include_pattern="experiments/**/*.py", + relative_path=os.getcwd() + ) + ``` + +2. **Selective Packaging** + + ```python + # Package only specific components + packager = run.PatternPackager( + include_pattern="models/transformer.py utils/data_loader.py", + relative_path=os.getcwd() + ) + ``` + +3. **Generated Code** + + ```python + # Package generated artifacts + packager = run.PatternPackager( + include_pattern="generated/**/*.py", + relative_path=os.getcwd() + ) + ``` + +## Hybrid Packager + +The `run.HybridPackager` allows you to combine multiple packaging strategies into a single archive, useful when you need different packaging approaches for different parts of your project. + +### How It Works + +1. **Multiple Packagers**: Combines several packagers into one +2. **Directory Organization**: Each packager's output goes to a specified directory +3. **Archive Merging**: Creates a single archive with organized structure +4. **Conflict Resolution**: Handles file name conflicts between packagers + +### Basic Usage + +```python +import nemo_run as run +import os + +# Combine Git archive with pattern packager +hybrid_packager = run.HybridPackager( + sub_packagers={ + "code": run.GitArchivePackager(subpath="src"), + "configs": run.PatternPackager( + include_pattern="configs/*.yaml", + relative_path=os.getcwd() + ) + } +) + +executor = run.SlurmExecutor( + account="my_account", + packager=hybrid_packager +) +``` + +### Directory Structure + +**Local Structure:** + +``` +project/ +├── src/ +│ ├── models/ +│ └── utils/ +├── configs/ +│ ├── model.yaml +│ └── data.yaml +├── generated/ +│ └── artifacts/ +└── README.md +``` + +**Hybrid Packager Configuration:** + +```python +hybrid_packager = run.HybridPackager( + sub_packagers={ + "code": run.GitArchivePackager(subpath="src"), + "configs": run.PatternPackager( + include_pattern="configs/*.yaml", + relative_path=os.getcwd() + ), + "artifacts": run.PatternPackager( + include_pattern="generated/**", + relative_path=os.getcwd() + ) + } +) +``` + +**Remote Working Directory:** + +``` +code/ +├── models/ +└── utils/ +configs/ +├── model.yaml +└── data.yaml +artifacts/ +└── generated/ + └── artifacts/ +``` + +### Advanced Configuration + +```python +import nemo_run as run +import os + +# Extract at root (no subdirectories) +hybrid_packager = run.HybridPackager( + sub_packagers={ + "": run.GitArchivePackager(subpath="src"), # Extract to root + "configs": run.PatternPackager( + include_pattern="configs/*.yaml", + relative_path=os.getcwd() + ) + }, + extract_at_root=True # All contents go to root +) + +# Custom working directory +hybrid_packager = run.HybridPackager( + sub_packagers={ + "code": run.GitArchivePackager(subpath="src"), + "configs": run.PatternPackager( + include_pattern="configs/*.yaml", + relative_path=os.getcwd() + ) + }, + working_dir="/workspace/experiment" +) +``` + +### Use Cases + +1. **Mixed Version Control** + + ```python + # Some code in Git, some not + hybrid_packager = run.HybridPackager( + sub_packagers={ + "code": run.GitArchivePackager(subpath="src"), + "experiments": run.PatternPackager( + include_pattern="experiments/**", + relative_path=os.getcwd() + ) + } + ) + ``` + +2. **Different Packaging Strategies** + + ```python + # Git for code, pattern for configs and data + hybrid_packager = run.HybridPackager( + sub_packagers={ + "code": run.GitArchivePackager(subpath="src"), + "configs": run.PatternPackager( + include_pattern="configs/*.yaml", + relative_path=os.getcwd() + ), + "data": run.PatternPackager( + include_pattern="data/processed/**", + relative_path=os.getcwd() + ) + } + ) + ``` + +3. **Generated Code with Source** + + ```python + # Source code from Git, generated code from patterns + hybrid_packager = run.HybridPackager( + sub_packagers={ + "source": run.GitArchivePackager(subpath="src"), + "generated": run.PatternPackager( + include_pattern="generated/**", + relative_path=os.getcwd() + ) + } + ) + ``` + +## Best Practices + +### 1. Choose the Right Packager + +```python +# Use GitArchivePackager for version-controlled code +if is_git_repo(): + packager = run.GitArchivePackager(subpath="src") +else: + packager = run.PatternPackager(include_pattern="src/**") + +# Use HybridPackager for complex projects +if has_multiple_sources(): + packager = run.HybridPackager(sub_packagers={...}) +``` + +### 2. Optimize Package Size + +```python +# Package only necessary files +packager = run.PatternPackager( + include_pattern="src/**/*.py configs/*.yaml", + exclude_pattern="**/*_test.py **/__pycache__/** **/.git/**" +) +``` + +### 3. Handle Dependencies + +```python +# Ensure dependencies are available +packager = run.GitArchivePackager(subpath="src") +executor = run.DockerExecutor( + container_image="pytorch/pytorch:latest", # Has dependencies + packager=packager +) +``` + +### 4. Test Packaging Locally + +```python +# Test packaging before remote execution +packager = run.GitArchivePackager(subpath="src") +# Use LocalExecutor to test packaging +executor = run.LocalExecutor(packager=packager) +``` + +## Troubleshoot + +### Common Issues + +1. **Git Archive Issues** + + ```bash + # Error: Not a git repository + # Solution: Ensure you're in a git repository + git init + git add . + git commit -m "Initial commit" + ``` + +2. **Pattern Matching Issues** + + ```python + # Error: No files found + # Solution: Check pattern and relative path + packager = run.PatternPackager( + include_pattern="src/**/*.py", + relative_path=os.getcwd() # Ensure this is correct + ) + ``` + +3. **Large Archive Issues** + + ```python + # Solution: Use more specific patterns + packager = run.PatternPackager( + include_pattern="src/models/**/*.py", # More specific + relative_path=os.getcwd() + ) + ``` + +### Debug + +1. **Check Package Contents** + + ```python + # Use LocalExecutor to inspect packaging + executor = run.LocalExecutor(packager=packager) + ``` + +2. **Verify Patterns** + + ```bash + # Test patterns locally + find . -name "*.py" -path "src/**" + ``` + +3. **Check Archive Size** + + ```python + # Monitor archive size for large projects + packager = run.GitArchivePackager(subpath="src") + ``` + +## Examples + +### Complete Example: ML Experiment Packaging + +```python +import nemo_run as run +import os + +def create_experiment_packager(): + """Create a comprehensive packager for ML experiments.""" + + # Check if we're in a git repository + if os.path.exists(".git"): + # Use hybrid packager for git + generated content + return run.HybridPackager( + sub_packagers={ + "code": run.GitArchivePackager(subpath="src"), + "configs": run.PatternPackager( + include_pattern="configs/*.yaml experiments/*.yaml", + relative_path=os.getcwd() + ), + "data": run.PatternPackager( + include_pattern="data/processed/**", + relative_path=os.getcwd() + ), + "artifacts": run.PatternPackager( + include_pattern="generated/**", + relative_path=os.getcwd() + ) + } + ) + else: + # Use pattern packager for non-git projects + return run.PatternPackager( + include_pattern="src/**/*.py configs/*.yaml data/processed/**", + exclude_pattern="**/*_test.py **/__pycache__/**", + relative_path=os.getcwd() + ) + +# Usage +packager = create_experiment_packager() +executor = run.SlurmExecutor( + account="my_account", + partition="gpu", + packager=packager +) +``` + +This packaging system provides the flexibility to handle various project structures and deployment scenarios, ensuring your code is properly packaged and deployed to remote execution environments. diff --git a/docs/guides/ray.md b/docs/guides/ray.md new file mode 100644 index 00000000..c4a730c5 --- /dev/null +++ b/docs/guides/ray.md @@ -0,0 +1,970 @@ +--- +description: "Comprehensive guide to deploying and managing Ray clusters and jobs with NeMo Run for distributed computing on Kubernetes and Slurm environments." +tags: ["ray", "distributed", "kubernetes", "slurm", "clusters", "jobs", "distributed-computing"] +categories: ["guides"] +--- + +# Deploy Ray Clusters and Jobs + +> **Audience**: Users familiar with NeMo Run executors who need distributed computing capabilities using Ray on Kubernetes or Slurm environments. +> +> **Overview**: NeMo Run provides unified abstractions for Ray cluster and job management across different execution backends, enabling seamless distributed computing workflows. + +## Architecture Overview + +NeMo Run's Ray integration provides a unified interface for distributed computing across multiple execution environments. The architecture consists of two primary abstractions: + +### Core Components + +| Component | Purpose | Supported Backends | +|-----------|---------|-------------------| +| `RayCluster` | Manages long-lived Ray clusters for interactive development | KubeRay (Kubernetes), Slurm | +| `RayJob` | Submits batch jobs to Ray clusters with automatic lifecycle management | KubeRay (Kubernetes), Slurm | + +### Execution Model + +```text +graph TB + A[NeMo Run API] --> B[RayCluster/RayJob] + B --> C{KubeRay Executor} + B --> D[Slurm Executor] + C --> E[Kubernetes Cluster] + D --> F[Slurm Cluster] + E --> G[Ray Head Node] + E --> H[Ray Worker Nodes] + F --> I[Ray Head Node] + F --> J[Ray Worker Nodes] +``` + +## RayCluster vs RayJob: Choosing the Right Approach + +NeMo Run offers two distinct approaches for Ray-based distributed computing, each optimized for different use cases and workflows. + +### RayCluster: Interactive Development + +RayCluster provides persistent, long-lived Ray clusters ideal for interactive development and iterative workflows. + +**Key Characteristics:** +- **Lifetime**: Remains active until explicitly stopped via `.stop()` +- **Resource Efficiency**: Single cluster setup cost amortized across multiple jobs +- **Multi-tenancy**: Supports multiple concurrent jobs on the same cluster +- **Dashboard Access**: Full Ray dashboard access via port forwarding +- **Use Cases**: Interactive development, debugging, hyperparameter tuning, iterative experimentation + +**When to Use RayCluster:** +- Interactive development with Jupyter notebooks or Ray CLI +- Multiple sequential job submissions requiring shared state +- Long-running experiments with frequent parameter adjustments +- Development workflows requiring persistent cluster state + +### RayJob: Batch Processing + +RayJob provides ephemeral clusters optimized for batch processing and automated workflows. + +**Key Characteristics:** +- **Lifetime**: Ephemeral - automatically terminates after job completion +- **Resource Efficiency**: Resources freed immediately after job completion +- **Single-tenancy**: One job per cluster instance +- **Dashboard Access**: Limited access (cluster terminates with job) +- **Use Cases**: CI/CD pipelines, scheduled training, production inference, automated workflows + +**When to Use RayJob:** +- Automated batch processing pipelines +- CI/CD workflows requiring reproducible execution +- Production inference jobs with predictable resource requirements +- Scenarios requiring automatic cleanup and resource management + +### Decision Matrix + +| Factor | RayCluster | RayJob | +|--------|------------|--------| +| **Development Phase** | Interactive/Exploratory | Production/Batch | +| **Job Frequency** | Multiple jobs per session | Single job per submission | +| **Resource Utilization** | High (shared cluster) | Low (ephemeral) | +| **Setup Overhead** | One-time | Per submission | +| **State Persistence** | Yes | No | +| **Automation** | Manual management | Fully automated | + +## Kubernetes Integration with KubeRay + +KubeRay provides native Ray support on Kubernetes, enabling cloud-native distributed computing with container orchestration. + +### KubeRay Architecture + +KubeRay extends Kubernetes with custom resources for Ray cluster management: + +- **RayCluster**: Custom resource defining Ray cluster topology +- **RayJob**: Custom resource for job submission to Ray clusters +- **RayService**: Custom resource for serving Ray applications + +### KubeRay Executor Configuration + +The `KubeRayExecutor` provides comprehensive configuration options for Kubernetes-based Ray deployments. + +#### Basic Configuration + +```python +from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup + +executor = KubeRayExecutor( + namespace="ml-team", + ray_version="2.43.0", + image="anyscale/ray:2.43.0-py312-cu125", + head_cpu="4", + head_memory="12Gi", + worker_groups=[ + KubeRayWorkerGroup( + group_name="worker", + replicas=2, + gpus_per_worker=8, + cpu_per_worker="16", + memory_per_worker="64Gi", + ) + ], +) +``` + +#### Advanced Configuration + +```python +executor = KubeRayExecutor( + namespace="ml-team", + ray_version="2.43.0", + image="anyscale/ray:2.43.0-py312-cu125", + head_cpu="8", + head_memory="32Gi", + + # Worker group configuration + worker_groups=[ + KubeRayWorkerGroup( + group_name="gpu-workers", + replicas=4, + gpus_per_worker=8, + cpu_per_worker="32", + memory_per_worker="128Gi", + min_replicas=2, + max_replicas=8, + ), + KubeRayWorkerGroup( + group_name="cpu-workers", + replicas=2, + cpu_per_worker="16", + memory_per_worker="64Gi", + ) + ], + + # Volume management + volume_mounts=[ + {"name": "workspace", "mountPath": "/workspace"}, + {"name": "datasets", "mountPath": "/datasets"}, + ], + volumes=[ + { + "name": "workspace", + "persistentVolumeClaim": {"claimName": "ml-workspace-pvc"}, + }, + { + "name": "datasets", + "persistentVolumeClaim": {"claimName": "datasets-pvc"}, + } + ], + + # Environment configuration + env_vars={ + "UV_PROJECT_ENVIRONMENT": "/home/ray/venvs/driver", + "NEMO_RL_VENV_DIR": "/home/ray/venvs", + "HF_HOME": "/workspace/hf_cache", + "CUDA_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7", + }, + + # Security and scheduling + spec_kwargs={ + "schedulerName": "runai-scheduler", + "priorityClassName": "high-priority", + }, + container_kwargs={ + "securityContext": { + "allowPrivilegeEscalation": False, + "runAsUser": 1000, + "runAsGroup": 1000, + "fsGroup": 1000, + } + }, + + # Resource management + reuse_volumes_in_worker_groups=True, + enable_in_tree_autoscaling=True, + autoscaling_mode="Default", +) +``` + +### Complete KubeRay Workflow Example + +```python +from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup +from nemo_run.run.ray.cluster import RayCluster +from nemo_run.run.ray.job import RayJob + +# 1. Configure KubeRay executor with production settings +executor = KubeRayExecutor( + namespace="ml-production", + ray_version="2.43.0", + image="anyscale/ray:2.43.0-py312-cu125", + head_cpu="8", + head_memory="32Gi", + worker_groups=[ + KubeRayWorkerGroup( + group_name="gpu-workers", + replicas=4, + gpus_per_worker=8, + cpu_per_worker="32", + memory_per_worker="128Gi", + min_replicas=2, + max_replicas=8, + ) + ], + volume_mounts=[{"name": "workspace", "mountPath": "/workspace"}], + volumes=[{ + "name": "workspace", + "persistentVolumeClaim": {"claimName": "ml-workspace-pvc"}, + }], + env_vars={ + "UV_PROJECT_ENVIRONMENT": "/home/ray/venvs/driver", + "HF_HOME": "/workspace/hf_cache", + }, +) + +# 2. Pre-start commands for environment setup +pre_ray_start = [ + "pip install uv", + "echo 'unset RAY_RUNTIME_ENV_HOOK' >> /home/ray/.bashrc", + "mkdir -p /workspace/hf_cache", +] + +# 3. Deploy persistent cluster for development +cluster = RayCluster(name="ml-dev-cluster", executor=executor) +cluster.start( + timeout=900, + pre_ray_start_commands=pre_ray_start, + wait_until_ready=True +) + +# 4. Expose Ray dashboard for monitoring +cluster.port_forward(port=8265, target_port=8265, wait=False) +print("Ray dashboard available at: http://localhost:8265") + +# 5. Submit training job to the cluster +job = RayJob(name="training-job-001", executor=executor) +job.start( + command="uv run python train.py --config configs/train.yaml", + workdir="/workspace/project/", + runtime_env_yaml="/workspace/project/runtime_env.yaml", + pre_ray_start_commands=pre_ray_start, +) + +# 6. Monitor job execution +job.logs(follow=True) + +# 7. Clean up resources +cluster.stop() +``` + +### KubeRay Best Practices + +#### Resource Management +- **Autoscaling**: Enable autoscaling for variable workloads +- **Resource Limits**: Set appropriate CPU/memory limits to prevent resource exhaustion +- **GPU Scheduling**: Use GPU-aware schedulers for optimal GPU utilization + +#### Volume Management +- **Persistent Storage**: Use PVCs for data persistence across job restarts +- **Code Synchronization**: Leverage automatic workdir synchronization for seamless development +- **Cache Management**: Mount dedicated volumes for model caches and datasets + +#### Security Configuration +- **RBAC**: Implement proper role-based access control +- **Network Policies**: Restrict network access between pods +- **Security Contexts**: Configure appropriate security contexts for containers + +## Slurm Integration + +Slurm integration enables Ray clusters on traditional HPC systems, leveraging existing job scheduling infrastructure and resource management. + +### Slurm Architecture + +Slurm-based Ray clusters utilize Slurm's job scheduling capabilities: + +- **Array Jobs**: Ray clusters are deployed as Slurm array jobs +- **Resource Allocation**: Leverages Slurm's native resource management +- **SSH Tunneling**: Remote access via SSH tunnels to login nodes + +### Slurm Executor Configuration + +The `SlurmExecutor` provides configuration options for HPC environments. + +#### Basic Configuration + +```python +from nemo_run.core.execution.slurm import SlurmExecutor, SSHTunnel + +# SSH tunnel configuration for remote access +ssh = SSHTunnel( + host="login.cluster.com", + user="username", + job_dir="/scratch/username/runs", + identity="~/.ssh/id_ed25519", +) + +executor = SlurmExecutor( + account="gpu-dept", + partition="a100", + nodes=2, + gpus_per_node=8, + time="04:00:00", + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + tunnel=ssh, +) +``` + +#### Advanced Configuration + +```python +from pathlib import Path +from nemo_run.core.execution.slurm import SlurmExecutor, SlurmJobDetails, SSHTunnel + +# Enhanced SSH tunnel with custom configuration +ssh = SSHTunnel( + host="login.cluster.com", + user="username", + job_dir="/scratch/username/runs", + identity="~/.ssh/id_ed25519", + port=22, + timeout=30, +) + +# Custom job details for enhanced logging +class CustomJobDetails(SlurmJobDetails): + @property + def stdout(self) -> Path: + assert self.folder + return Path(self.folder) / "slurm_stdout.log" + + @property + def stderr(self) -> Path: + assert self.folder + return Path(self.folder) / "slurm_stderr.log" + +executor = SlurmExecutor( + # Slurm job configuration + account="gpu-dept", + partition="a100", + nodes=4, + gpus_per_node=8, + gres="gpu:8", + time="08:00:00", + qos="high", + + # Container configuration + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + container_mounts=[ + "/scratch:/scratch", + "/home:/home", + "/datasets:/datasets" + ], + + # Environment variables + env_vars={ + "HF_HOME": "/scratch/hf_cache", + "CUDA_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7", + "NCCL_DEBUG": "INFO", + }, + + # SSH tunnel configuration + tunnel=ssh, + + # Custom job details + job_details=CustomJobDetails(), + + # Additional Slurm options + slurm_options={ + "mail-type": "ALL", + "mail-user": "user@example.com", + "exclusive": None, # Flag without value + } +) +``` + +### Complete Slurm Workflow Example + +```python +import os +from pathlib import Path + +import nemo_run as run +from nemo_run.core.execution.slurm import SlurmExecutor, SSHTunnel +from nemo_run.run.ray.cluster import RayCluster +from nemo_run.run.ray.job import RayJob + +# 1. Configure SSH tunnel for remote cluster access +ssh = SSHTunnel( + host="login.hpc.cluster.com", + user="jdoe", + job_dir="/scratch/jdoe/runs", + identity="~/.ssh/id_ed25519", +) + +# 2. Configure Slurm executor for large-scale training +executor = SlurmExecutor( + account="gpu-dept", + partition="a100", + nodes=8, + gpus_per_node=8, + gres="gpu:8", + time="12:00:00", + qos="high", + container_image="nvcr.io/nvidia/pytorch:24.05-py3", + container_mounts=[ + "/scratch:/scratch", + "/home:/home", + "/datasets:/datasets" + ], + env_vars={ + "HF_HOME": "/scratch/hf_cache", + "CUDA_VISIBLE_DEVICES": "0,1,2,3,4,5,6,7", + "NCCL_DEBUG": "INFO", + "NCCL_IB_DISABLE": "0", + }, + tunnel=ssh, +) + +# 3. Environment setup commands +pre_ray_start = [ + "pip install uv", + "mkdir -p /scratch/hf_cache", + "export NCCL_DEBUG=INFO", +] + +# 4. Deploy Ray cluster on Slurm +cluster = RayCluster(name="hpc-training-cluster", executor=executor) +cluster.start( + timeout=1800, + pre_ray_start_commands=pre_ray_start, + wait_until_ready=True +) + +# 5. Set up port forwarding for dashboard access +cluster.port_forward(port=8265, target_port=8265) +print("Ray dashboard available at: http://localhost:8265") + +# 6. Submit distributed training job +job = RayJob(name="distributed-training", executor=executor) +job.start( + command="uv run python train.py --config configs/distributed.yaml", + workdir="/scratch/jdoe/project/", + pre_ray_start_commands=pre_ray_start, +) + +# 7. Monitor training progress +job.logs(follow=True) + +# 8. Clean up resources +cluster.stop() +``` + +### Slurm Best Practices + +#### Resource Optimization +- **Partition Selection**: Choose appropriate partitions based on resource requirements +- **QoS Configuration**: Use appropriate quality of service levels for job priority +- **Node Allocation**: Optimize node allocation for specific workload requirements + +#### Network Configuration +- **NCCL Settings**: Configure NCCL for optimal inter-node communication +- **InfiniBand**: Enable InfiniBand for high-performance networking +- **Firewall Rules**: Ensure proper network access for Ray cluster communication + +#### Storage Management +- **Scratch Space**: Utilize scratch directories for temporary data +- **Persistent Storage**: Use home directories for code and configuration files +- **Data Locality**: Mount datasets close to compute resources + +## API Reference + +### RayCluster API + +The `RayCluster` class provides comprehensive cluster management capabilities. + +#### Core Methods + +```python +# Cluster lifecycle management +cluster = RayCluster(name="my-cluster", executor=executor) + +# Start cluster with configuration +cluster.start( + timeout=600, # Maximum wait time in seconds + wait_until_ready=True, # Block until cluster is ready + pre_ray_start_commands=[ # Commands to run before Ray starts + "pip install -r requirements.txt", + "mkdir -p /workspace/data" + ] +) + +# Check cluster status +status = cluster.status(display=True) # Display status and return info + +# Access Ray dashboard +cluster.port_forward( + port=8265, # Local port + target_port=8265, # Remote port + wait=False # Don't block on port forwarding +) + +# Stop and clean up cluster +cluster.stop() +``` + +#### Advanced Methods + +```python +# Get cluster configuration +config = cluster.get_config() + +# Scale worker groups +cluster.scale_worker_group("worker", replicas=4) + +# Get cluster logs +logs = cluster.get_logs() + +# Check cluster health +health = cluster.health_check() +``` + +### RayJob API + +The `RayJob` class provides job submission and management capabilities. + +#### Core Methods + +```python +# Job lifecycle management +job = RayJob(name="my-job", executor=executor) + +# Submit job to cluster +job.start( + command="python train.py --config config.yaml", # Job command + workdir="/workspace/project/", # Working directory + runtime_env_yaml="/path/to/runtime_env.yaml", # Runtime environment + pre_ray_start_commands=[ # Pre-start commands + "pip install -r requirements.txt" + ] +) + +# Check job status +status = job.status() + +# Stream job logs +job.logs(follow=True, tail=100) # Follow logs, show last 100 lines + +# Stop job execution +job.stop() +``` + +#### Advanced Methods + +```python +# Get job configuration +config = job.get_config() + +# Get job metrics +metrics = job.get_metrics() + +# Submit multiple jobs +jobs = [] +for i in range(5): + job = RayJob(name=f"job-{i}", executor=executor) + job.start(command=f"python script.py --seed {i}") + jobs.append(job) + +# Wait for all jobs to complete +for job in jobs: + job.logs(follow=True) +``` + +## Advanced Configuration + +### Runtime Environment Management + +Ray runtime environments provide isolated execution contexts for jobs. + +```python +# Runtime environment configuration +runtime_env = { + "working_dir": "/workspace/project", + "pip": { + "packages": ["torch", "transformers", "datasets"] + }, + "env_vars": { + "CUDA_VISIBLE_DEVICES": "0,1,2,3", + "HF_HOME": "/workspace/hf_cache" + }, + "container": { + "image": "anyscale/ray:2.43.0-py312-cu125" + } +} + +# Save runtime environment to file +import yaml +with open("runtime_env.yaml", "w") as f: + yaml.dump(runtime_env, f) + +# Use in job submission +job.start( + command="python train.py", + runtime_env_yaml="runtime_env.yaml" +) +``` + +### Custom Resource Scheduling + +Configure custom resource requirements for specialized workloads. + +```python +# Custom resource configuration for KubeRay +executor = KubeRayExecutor( + # ... other configuration ... + worker_groups=[ + KubeRayWorkerGroup( + group_name="specialized-workers", + replicas=2, + gpus_per_worker=8, + custom_resources={ + "nvidia.com/mig-1g.5gb": 1, + "nvidia.com/mig-3g.20gb": 2, + } + ) + ] +) +``` + +### Monitoring and Observability + +Implement comprehensive monitoring for Ray clusters and jobs. + +```python +# Enable detailed logging +import logging +logging.basicConfig(level=logging.DEBUG) + +# Monitor cluster metrics +cluster = RayCluster(name="monitored-cluster", executor=executor) +cluster.start() + +# Access Ray dashboard metrics +cluster.port_forward(port=8265, target_port=8265) + +# Monitor job progress +job = RayJob(name="monitored-job", executor=executor) +job.start(command="python train.py") + +# Stream logs with custom formatting +for line in job.logs(follow=True): + if "loss" in line: + print(f"Training loss: {line}") +``` + +## Troubleshooting + +### Common Issues and Solutions + +#### Cluster Startup Failures + +**Issue**: Cluster fails to start within timeout period +**Solutions**: +- Increase timeout value in `cluster.start(timeout=1800)` +- Check resource availability in target partition/namespace +- Verify network connectivity and firewall rules +- Review pre-start commands for errors + +#### Job Submission Failures + +**Issue**: Jobs fail to submit or execute +**Solutions**: +- Verify cluster is in ready state before job submission +- Check runtime environment configuration +- Ensure working directory exists and is accessible +- Review job command syntax and dependencies + +#### Performance Issues + +**Issue**: Poor distributed training performance +**Solutions**: +- Configure NCCL settings for optimal communication +- Verify GPU topology and network configuration +- Use appropriate batch sizes and gradient accumulation +- Monitor resource utilization and bottlenecks + +#### Network Connectivity + +**Issue**: Ray dashboard or job communication failures +**Solutions**: +- Verify port forwarding configuration +- Check firewall rules and network policies +- Ensure proper DNS resolution +- Review SSH tunnel configuration for Slurm deployments + +### Debugging Techniques + +#### Log Analysis + +```python +# Enable verbose logging +import logging +logging.getLogger("nemo_run").setLevel(logging.DEBUG) + +# Collect detailed logs +cluster_logs = cluster.get_logs() +job_logs = job.get_logs() + +# Analyze log patterns +for log in cluster_logs: + if "ERROR" in log: + print(f"Cluster error: {log}") +``` + +#### Health Checks + +```python +# Perform comprehensive health check +health_status = cluster.health_check() +print(f"Cluster health: {health_status}") + +# Check individual components +head_health = cluster.check_head_node() +worker_health = cluster.check_worker_nodes() +``` + +## Integration Patterns + +### CI/CD Integration + +Integrate Ray jobs into continuous integration and deployment pipelines. + +```python +# GitHub Actions workflow example +def run_training_job(): + executor = KubeRayExecutor( + namespace="ci-cd", + worker_groups=[KubeRayWorkerGroup(group_name="worker", replicas=1, gpus_per_worker=4)] + ) + + job = RayJob(name="ci-training", executor=executor) + job.start( + command="python train.py --config configs/ci.yaml", + workdir="./", + ) + + # Wait for completion and check exit code + job.logs(follow=True) + if job.status() != "SUCCEEDED": + raise Exception("Training job failed") +``` + +### Multi-Environment Deployment + +Deploy Ray applications across different environments with consistent configuration. + +```python +# Environment-specific configuration +environments = { + "dev": { + "namespace": "ml-dev", + "replicas": 1, + "gpus_per_worker": 2, + }, + "staging": { + "namespace": "ml-staging", + "replicas": 2, + "gpus_per_worker": 4, + }, + "production": { + "namespace": "ml-prod", + "replicas": 4, + "gpus_per_worker": 8, + } +} + +def deploy_to_environment(env_name): + config = environments[env_name] + executor = KubeRayExecutor( + namespace=config["namespace"], + worker_groups=[KubeRayWorkerGroup( + group_name="worker", + replicas=config["replicas"], + gpus_per_worker=config["gpus_per_worker"] + )] + ) + + cluster = RayCluster(name=f"{env_name}-cluster", executor=executor) + cluster.start() + return cluster +``` + +### Custom CLI Applications + +Build custom command-line interfaces for Ray cluster and job management. + +```python +import argparse +import sys +from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup +from nemo_run.run.ray.cluster import RayCluster +from nemo_run.run.ray.job import RayJob + +def create_cluster(args): + executor = KubeRayExecutor( + namespace=args.namespace, + worker_groups=[KubeRayWorkerGroup( + group_name="worker", + replicas=args.replicas, + gpus_per_worker=args.gpus + )] + ) + + cluster = RayCluster(name=args.name, executor=executor) + cluster.start() + print(f"Cluster {args.name} started successfully") + +def submit_job(args): + executor = KubeRayExecutor(namespace=args.namespace) + job = RayJob(name=args.name, executor=executor) + job.start(command=args.command, workdir=args.workdir) + print(f"Job {args.name} submitted successfully") + +def main(): + parser = argparse.ArgumentParser(description="Ray Cluster and Job Manager") + subparsers = parser.add_subparsers(dest="command") + + # Cluster management + cluster_parser = subparsers.add_parser("cluster", help="Manage clusters") + cluster_parser.add_argument("--name", required=True, help="Cluster name") + cluster_parser.add_argument("--namespace", default="default", help="Kubernetes namespace") + cluster_parser.add_argument("--replicas", type=int, default=1, help="Number of workers") + cluster_parser.add_argument("--gpus", type=int, default=1, help="GPUs per worker") + + # Job management + job_parser = subparsers.add_parser("job", help="Submit jobs") + job_parser.add_argument("--name", required=True, help="Job name") + job_parser.add_argument("--command", required=True, help="Job command") + job_parser.add_argument("--workdir", default="./", help="Working directory") + job_parser.add_argument("--namespace", default="default", help="Kubernetes namespace") + + args = parser.parse_args() + + if args.command == "cluster": + create_cluster(args) + elif args.command == "job": + submit_job(args) + else: + parser.print_help() + sys.exit(1) + +if __name__ == "__main__": + main() +``` + +## Performance Optimization + +### Resource Allocation Strategies + +Optimize resource allocation for different workload types. + +```python +# CPU-intensive workloads +cpu_executor = KubeRayExecutor( + worker_groups=[KubeRayWorkerGroup( + group_name="cpu-workers", + replicas=8, + cpu_per_worker="16", + memory_per_worker="64Gi" + )] +) + +# GPU-intensive workloads +gpu_executor = KubeRayExecutor( + worker_groups=[KubeRayWorkerGroup( + group_name="gpu-workers", + replicas=4, + gpus_per_worker=8, + cpu_per_worker="32", + memory_per_worker="128Gi" + )] +) + +# Mixed workloads +mixed_executor = KubeRayExecutor( + worker_groups=[ + KubeRayWorkerGroup( + group_name="gpu-workers", + replicas=2, + gpus_per_worker=8, + ), + KubeRayWorkerGroup( + group_name="cpu-workers", + replicas=4, + cpu_per_worker="16", + ) + ] +) +``` + +### Network Optimization + +Configure network settings for optimal distributed training performance. + +```python +# Optimize NCCL settings for high-performance networking +env_vars = { + "NCCL_DEBUG": "INFO", + "NCCL_IB_DISABLE": "0", + "NCCL_IB_HCA": "mlx5_0", + "NCCL_IB_SL": "0", + "NCCL_IB_TC": "41", + "NCCL_IB_QPS_PER_CONNECTION": "4", + "NCCL_IB_TIMEOUT": "23", + "NCCL_IB_RETRY_CNT": "7", + "NCCL_IB_PKEY": "0xffff", + "NCCL_IB_USE_INLINE": "1", + "NCCL_IB_ADAPTIVE_ROUTING": "1", +} + +executor = KubeRayExecutor( + env_vars=env_vars, + # ... other configuration +) +``` + +### Memory Management + +Optimize memory usage for large-scale training workloads. + +```python +# Configure memory-efficient settings +env_vars = { + "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:128", + "CUDA_LAUNCH_BLOCKING": "1", + "TORCH_USE_CUDA_DSA": "1", +} + +# Use gradient checkpointing for memory efficiency +training_command = """ +python train.py \ + --config configs/train.yaml \ + --gradient_checkpointing \ + --max_memory_MB 8192 +""" +``` + +--- + +This comprehensive guide covers all aspects of Ray distributed computing with NeMo Run, from basic concepts to advanced optimization techniques. The unified API across Kubernetes and Slurm environments enables seamless distributed computing workflows regardless of the underlying infrastructure. diff --git a/docs/index.md b/docs/index.md deleted file mode 100644 index ae906b9b..00000000 --- a/docs/index.md +++ /dev/null @@ -1,152 +0,0 @@ ---- -description: "Explore comprehensive documentation for our software platform, including tutorials, feature guides, and deployment instructions." -tags: ["overview", "quickstart", "getting-started"] -categories: ["getting-started"] ---- - -(template-home)= - -# {{ product_name }} Documentation - -Welcome to the {{ product_name_short }} documentation. - -## Introduction to {{ product_name_short }} - -Learn about the {{ product_name_short }}, how it works at a high level, and its key features. - -## Featureset Workflows - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Feature Set A -:link: feature-set-a -:link-type: ref -:link-alt: Feature Set A documentation home - -Comprehensive tools and workflows for data processing and analysis. -Get started with our core feature set. -::: - -:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Feature Set B -:link: feature-set-b -:only: not ga -:link-type: ref -:link-alt: Feature Set B documentation home - -Advanced integration capabilities and specialized processing tools. -Available in Early Access. -::: - -:::: - -## Tutorial Highlights - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Feature Set A Tutorials -:link: feature-set-a-tutorials -:link-type: ref -:link-alt: Feature Set A tutorial collection - -Step-by-step guides for getting the most out of Feature Set A -::: - -:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Feature Set B Tutorials -:link: feature-set-b-tutorials -:only: not ga -:link-type: ref -:link-alt: Feature Set B tutorial collection - -Hands-on tutorials for Feature Set B workflows -::: - -:::: - -## Install & Deploy Guides - -::::{grid} 1 1 1 2 -:gutter: 1 1 1 2 - -:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Deployment Patterns -:link: admin-deployment -:link-type: ref -:link-alt: Deployment and configuration guides - -Learn how to deploy and configure your environment -::: - -:::{grid-item-card} {octicon}`plug;1.5em;sd-mr-1` Integration Patterns -:link: admin-integrations -:link-type: ref -:link-alt: Integration and connection guides - -Connect with external systems and services -::: - -:::: - ---- - -::::{toctree} -:hidden: -Home -:::: - -::::{toctree} -:hidden: -:caption: About -:maxdepth: 1 -about/index.md -about/key-features.md -about/concepts/index.md -about/release-notes/index.md -:::: - -::::{toctree} -:hidden: -:caption: Get Started -:maxdepth: 2 - -get-started/index.md -Feature Set A Quickstart -Feature Set B Quickstart :only: not ga -:::: - -::::{toctree} -:hidden: -:caption: (GA) Feature Set A -:maxdepth: 2 -feature-set-a/index.md -Tutorials -feature-set-a/category-a/index.md -:::: - -::::{toctree} -:hidden: -:caption: (EA) Feature Set B -:maxdepth: 2 -:only: not ga - -feature-set-b/index.md -Tutorials -feature-set-b/category-a/index.md -:::: - -::::{toctree} -:hidden: -:caption: Admin -:maxdepth: 2 -admin/index.md -Deployment -Integrations -CI/CD -:::: - -::::{toctree} -:hidden: -:caption: Reference -:maxdepth: 2 -reference/index.md -:::: diff --git a/docs/nemo-run-index.md b/docs/nemo-run-index.md new file mode 100644 index 00000000..b6f5aded --- /dev/null +++ b/docs/nemo-run-index.md @@ -0,0 +1,254 @@ +--- +description: "NeMo Run documentation - Streamline ML experiment configuration, execution and management" +tags: ["nemo-run", "ml", "experiments", "configuration", "execution", "management"] +categories: ["documentation"] +--- + +(nemo-run-home)= + +# NeMo Run Documentation + +NeMo Run is a powerful tool designed to streamline the configuration, execution and management of Machine Learning experiments across various computing environments. NeMo Run has three core responsibilities: + +::::{grid} 1 1 1 3 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration +:link: guides/configuration +:link-type: doc +:link-alt: Configuration guide + +Learn how to configure your ML experiments and environments. +::: + +:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Execution +:link: guides/execution +:link-type: doc +:link-alt: Execution guide + +Execute your configured experiments across various computing environments. +::: + +:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Management +:link: guides/management +:link-type: doc +:link-alt: Management guide + +Manage and monitor your running experiments and results. +::: + +:::: + +This is the typical order Nemo Run users will follow to setup and launch experiments. + +--- + +## About + +::::{grid} 1 1 1 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`info;1.5em;sd-mr-1` About NeMo Run +:link: about/index +:link-type: doc +:link-alt: About NeMo Run + +Overview of NeMo Run's core concepts and architecture. +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Key Features +:link: about/key-features +:link-type: doc +:link-alt: Key features + +Explore the technical capabilities and implementation details. +::: + +:::{grid-item-card} {octicon}`star;1.5em;sd-mr-1` Why Choose NeMo Run +:link: about/why-nemo-run +:link-type: doc +:link-alt: Why choose NeMo Run + +Learn why NeMo Run is the preferred choice for ML experiment management. +::: + +:::: + +--- + +## Get Started + +::::{grid} 1 1 1 2 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`rocket;1.5em;sd-mr-1` Get Started with NeMo Run +:link: get-started/index +:link-type: doc +:link-alt: Get Started with NeMo Run + +Overview and quick start options for NeMo Run +::: + +:::{grid-item-card} {octicon}`download;1.5em;sd-mr-1` Install NeMo Run +:link: get-started/install +:link-type: doc +:link-alt: Installation guide + +Install NeMo Run and optional dependencies for your environment +::: + +:::{grid-item-card} {octicon}`zap;1.5em;sd-mr-1` Quickstart +:link: get-started/quickstart +:link-type: doc +:link-alt: Quickstart Guide + +Complete guide to install and run your first ML experiment in minutes +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Tutorials and Learning Resources +:link: get-started/tutorials +:link-type: doc +:link-alt: Tutorial collection + +Learn NeMo Run with hands-on tutorials and examples +::: +:::: + +--- + +## Guides + +::::{grid} 1 1 1 2 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`gear;1.5em;sd-mr-1` Configuration +:link: guides/configuration +:link-type: doc +:link-alt: Configuration guide + +Learn how to configure your ML experiments with type-safe, flexible configuration management. +::: + +:::{grid-item-card} {octicon}`play;1.5em;sd-mr-1` Execution +:link: guides/execution +:link-type: doc +:link-alt: Execution guide + +Execute your experiments across local, Docker, Slurm, Kubernetes, and cloud environments. +::: + +:::{grid-item-card} {octicon}`graph;1.5em;sd-mr-1` Management +:link: guides/management +:link-type: doc +:link-alt: Management guide + +Manage and monitor your experiments with comprehensive tracking and reproducibility. +::: + +:::{grid-item-card} {octicon}`server;1.5em;sd-mr-1` Ray Clusters and Jobs +:link: guides/ray +:link-type: doc +:link-alt: Ray Clusters and Jobs + +Deploy and manage Ray clusters and jobs for scalable distributed computing. +::: + +:::{grid-item-card} {octicon}`package;1.5em;sd-mr-1` Packaging Strategies +:link: guides/packaging +:link-type: doc +:link-alt: Packaging Strategies + +Deploy your code using Git archives, pattern matching, or hybrid packaging strategies. +::: +:::: + +--- + + + +--- + +## References + +::::{grid} 1 1 1 2 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` CLI Reference +:link: reference/cli +:link-type: doc +:link-alt: CLI Reference + +Complete command-line interface documentation and usage examples. +::: + +:::{grid-item-card} {octicon}`question;1.5em;sd-mr-1` FAQs +:link: reference/faqs +:link-type: doc +:link-alt: Frequently Asked Questions + +Find answers to common questions about NeMo Run. +::: + +:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Troubleshooting +:link: reference/troubleshooting +:link-type: doc +:link-alt: Troubleshooting Guide + +Solutions for common issues and error messages. +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Glossary +:link: reference/glossary +:link-type: doc +:link-alt: NeMo Run Glossary + +Technical glossary of NeMo Run-specific concepts and terminology. +::: + +:::: + +--- + +::::{toctree} +:hidden: +:caption: About +:maxdepth: 2 +about/index +about/key-features +about/why-nemo-run +:::: + +::::{toctree} +:hidden: +:caption: Get Started +:maxdepth: 2 +get-started/index +get-started/install +get-started/quickstart +get-started/tutorials +:::: + +::::{toctree} +:hidden: +:caption: Guides +:maxdepth: 2 +guides/index +guides/configuration +guides/execution +guides/management +guides/packaging +guides/ray +:::: + + + +::::{toctree} +:hidden: +:caption: References +:maxdepth: 2 +reference/index +reference/cli +reference/faqs +reference/troubleshooting +reference/glossary +:::: diff --git a/docs/reference/cli.md b/docs/reference/cli.md new file mode 100644 index 00000000..7e06e4db --- /dev/null +++ b/docs/reference/cli.md @@ -0,0 +1,798 @@ +--- +description: "Complete guide to NeMo Run CLI, including entry point creation, factory functions, and CLI argument parsing." +tags: ["cli", "command-line", "entry points", "factories", "arguments"] +categories: ["guides"] +--- + +(cli)= + +# NeMo Run CLI + +NeMo Run provides a powerful command-line interface that transforms Python functions into sophisticated CLI tools with rich argument parsing, type safety, and seamless integration with execution backends. This system enables AI researchers to create reproducible, configurable experiments that can be executed across diverse computing environments. + +## Overview + +The CLI system transforms Python functions into command-line tools with: + +- **Rich Argument Parsing**: Support for complex Python types, nested configurations, and operations +- **Factory Functions**: Reusable configuration components for complex objects +- **Executor Integration**: Seamless integration with execution backends (Docker, Slurm, Kubernetes, etc.) +- **Interactive Mode**: REPL-style interaction for configuration exploration +- **Configuration Export**: Export configurations to YAML, TOML, or JSON formats +- **Type Safety**: Full type checking and validation with intelligent error correction +- **Error Correction**: Intelligent suggestions for typos and parameter names + +## Core Concepts + +For detailed definitions of CLI terms and concepts, see the [Glossary](glossary.md). + +### Entry Points + +Entry points are Python functions decorated with `@run.cli.entrypoint` that become accessible as CLI commands. They support: + +- **Parameter Discovery**: Automatic exposure of function parameters as CLI arguments +- **Type Safety**: Type hints are used for validation and parsing +- **Default Values**: Sensible defaults for rapid prototyping +- **Help Text**: Rich documentation and usage information + +### Factory Functions + +Factory functions (decorated with `@run.cli.factory`) create reusable configuration components: + +- **Object Creation**: Instantiate complex objects from CLI arguments +- **Type Registration**: Register factories for specific types or parameters +- **Default Factories**: Provide sensible defaults for complex configurations +- **Composition**: Chain and nest factories for complex workflows + +### Run Context + +The `RunContext` manages execution settings and provides: + +- **Executor Configuration**: Specify execution environments +- **Plugin Management**: Configure and manage execution plugins +- **Execution Control**: Dry run, detached execution, and log management +- **Configuration Export**: Export configurations in various formats + +## Basic CLI Usage + +### Create Entry Points + +Use the `@run.cli.entrypoint` decorator to expose Python functions as CLI commands: + +```python +import nemo_run as run + +@run.cli.entrypoint +def train_model( + model_name: str = "gpt2", + learning_rate: float = 0.001, + batch_size: int = 32, + epochs: int = 10 +): + """Train a machine learning model with specified parameters.""" + try: + print(f"Training {model_name} with lr={learning_rate}, batch_size={batch_size}") + # Your training logic here + return {"accuracy": 0.95, "loss": 0.1} + except Exception as e: + print(f"Training failed: {e}") + return {"accuracy": 0.0, "loss": float('inf')} + +@run.cli.entrypoint +def evaluate_model( + model_path: str, + test_data: str, + metrics: list[str] = ["accuracy", "precision", "recall"] +): + """Evaluate a trained model on test data.""" + try: + print(f"Evaluating {model_path} on {test_data}") + print(f"Metrics: {metrics}") + # Your evaluation logic here + return {"accuracy": 0.92, "precision": 0.89, "recall": 0.91} + except Exception as e: + print(f"Evaluation failed: {e}") + return {"accuracy": 0.0, "precision": 0.0, "recall": 0.0} +``` + +### CLI Argument Syntax + +NeMo Run supports rich Python-like argument syntax: + +```bash +# Basic arguments +python script.py model_name=gpt2 learning_rate=0.001 + +# Nested attribute setting +python script.py model.hidden_size=512 data.batch_size=64 + +# List and dictionary arguments +python script.py layers=[128,256,512] config={'dropout': 0.1} + +# Operations on arguments +python script.py counter+=1 rate*=2 flags|=0x1 + +# Type casting +python script.py int_arg=42 float_arg=3.14 bool_arg=true + +# None values +python script.py optional_arg=None + +# Factory function usage +python script.py model=create_model(hidden_size=256) +``` + +## Factory Functions + +Factory functions allow you to create reusable configuration components: + +```python +import nemo_run as run +from dataclasses import dataclass +from typing import List + +@dataclass +class OptimizerConfig: + type: str + lr: float + betas: List[float] + weight_decay: float + +@run.cli.factory +def create_optimizer(optimizer_type: str = "adam", lr: float = 0.001) -> OptimizerConfig: + """Create an optimizer configuration.""" + if optimizer_type == "adam": + return OptimizerConfig( + type="adam", + lr=lr, + betas=[0.9, 0.999], + weight_decay=1e-5 + ) + elif optimizer_type == "sgd": + return OptimizerConfig( + type="sgd", + lr=lr, + betas=[0.0, 0.0], + weight_decay=1e-4 + ) + else: + raise ValueError(f"Unknown optimizer: {optimizer_type}") + +@run.cli.entrypoint +def train_with_optimizer( + model: str, + optimizer: OptimizerConfig = create_optimizer(optimizer_type="adam", lr=0.001) +): + """Train a model with a configurable optimizer.""" + print(f"Training {model} with {optimizer.type} optimizer") + print(f"Learning rate: {optimizer.lr}") +``` + +### Factory Registration Patterns + +#### Type-Based Registration + +Register factories for specific types: + +```python +@run.cli.factory +def create_transformer_model() -> run.Config[TransformerModel]: + """Create a default transformer model configuration.""" + return run.Config( + TransformerModel, + hidden_size=512, + num_layers=6, + num_attention_heads=8 + ) + +@run.cli.factory +def create_cnn_model() -> run.Config[CNNModel]: + """Create a default CNN model configuration.""" + return run.Config( + CNNModel, + channels=[64, 128, 256], + kernel_sizes=[3, 3, 3] + ) +``` + +#### Parameter-Specific Registration + +Register factories for specific parameters: + +```python +@run.cli.factory(target=train_model, target_arg="model") +def create_default_model() -> run.Config[BaseModel]: + """Default model factory for train_model function.""" + return create_transformer_model() + +@run.cli.factory(target=train_model, target_arg="optimizer") +def create_default_optimizer() -> OptimizerConfig: + """Default optimizer factory for train_model function.""" + return create_optimizer(optimizer_type="adam", lr=0.001) +``` + +### Use Factories in CLI + +```bash +# Use default factory +python script.py train_with_optimizer model=resnet50 + +# Override factory parameters +python script.py train_with_optimizer model=resnet50 optimizer=create_optimizer(optimizer_type=sgd,lr=0.01) + +# Nested factory usage +python script.py train_with_optimizer model=resnet50 optimizer.lr=0.005 + +# Use type-based factories +python script.py train_model model=create_transformer_model optimizer=create_optimizer +``` + +## Executor Integration + +### Default Executors + +Set default executors for your entry points: + +```python +import nemo_run as run + +@run.cli.entrypoint( + default_executor=run.DockerExecutor( + container_image="pytorch/pytorch:latest", + num_gpus=1 + ) +) +def train_model(model: str, epochs: int = 10): + """Train a model using Docker executor by default.""" + print(f"Training {model} for {epochs} epochs") +``` + +### CLI Executor Override + +```bash +# Use default executor +python script.py train_model model=resnet50 + +# Override with different executor +python script.py train_model model=resnet50 executor=run.LocalExecutor() + +# Configure executor parameters +python script.py train_model model=resnet50 executor=run.SlurmExecutor(partition=gpu,time=02:00:00) + +# Override executor settings +python script.py train_model model=resnet50 executor.num_gpus=4 executor.memory=32g +``` + +## Advanced CLI Features + +### Interactive Mode (REPL) + +Start an interactive session to explore configurations: + +```bash +python script.py train_model --repl +``` + +This opens an interactive Python shell where you can: + +```python +>>> model_config = create_optimizer(optimizer_type="adam", lr=0.001) +>>> print(model_config) +OptimizerConfig(type='adam', lr=0.001, betas=[0.9, 0.999], weight_decay=1e-05) + +>>> # Modify and test configurations +>>> model_config.lr = 0.0001 +>>> print(model_config) +OptimizerConfig(type='adam', lr=0.0001, betas=[0.9, 0.999], weight_decay=1e-05) + +>>> # Test complex configurations +>>> training_config = run.Config( +... TrainingJob, +... model=create_transformer_model(), +... optimizer=model_config, +... epochs=100 +... ) +>>> print(training_config) +``` + +### Configuration Export + +Export your configurations to various formats: + +```bash +# Export to YAML +python script.py train_model model=resnet50 --to-yaml config.yaml + +# Export to TOML +python script.py train_model model=resnet50 --to-toml config.toml + +# Export to JSON +python script.py train_model model=resnet50 --to-json config.json + +# Export specific sections +python script.py train_model model=resnet50 --to-yaml config.yaml --section model +``` + +### Dry Run Mode + +Preview what would be executed without actually running: + +```bash +python script.py train_model model=resnet50 --dryrun +``` + +### Detached Execution + +Run tasks in the background: + +```bash +python script.py train_model model=resnet50 --detach +``` + +### Tail Logs + +Follow logs in real-time: + +```bash +python script.py train_model model=resnet50 --tail-logs +``` + +## CLI Options Reference + +### Global Options + +| Option | Description | Example | +|--------|-------------|---------| +| `--name, -n` | Name of the run | `--name my_experiment` | +| `--direct` | Execute directly (no executor) | `--direct` | +| `--dryrun` | Preview without execution | `--dryrun` | +| `--factory, -f` | Use predefined factory | `--factory my_factory` | +| `--load, -l` | Load factory from directory | `--load ./configs/` | +| `--yaml, -y` | Load from YAML file | `--yaml config.yaml` | +| `--repl, -r` | Enter interactive mode | `--repl` | +| `--detach` | Run in background | `--detach` | +| `--yes, -y` | Skip confirmation | `--yes` | +| `--tail-logs` | Follow logs | `--tail-logs` | +| `--verbose, -v` | Enable verbose logging | `--verbose` | + +### Output Options + +| Option | Description | Example | +|--------|-------------|---------| +| `--to-yaml` | Export to YAML | `--to-yaml output.yaml` | +| `--to-toml` | Export to TOML | `--to-toml output.toml` | +| `--to-json` | Export to JSON | `--to-json output.json` | + +### Rich Output Options + +| Option | Description | Example | +|--------|-------------|---------| +| `--rich-exceptions` | Enable rich exception formatting | `--rich-exceptions` | +| `--rich-traceback-short` | Short traceback format | `--rich-traceback-short` | +| `--rich-traceback-full` | Full traceback format | `--rich-traceback-full` | +| `--rich-show-locals` | Show local variables in exceptions | `--rich-show-locals` | +| `--rich-hide-locals` | Hide local variables in exceptions | `--rich-hide-locals` | +| `--rich-theme` | Color theme (dark/light/monochrome) | `--rich-theme dark` | + +## Advanced Patterns + +### Complex Configuration Management + +```python +from dataclasses import dataclass +from typing import Optional, List, Dict, Any +import nemo_run as run + +@dataclass +class ModelConfig: + architecture: str + hidden_size: int + num_layers: int + dropout: float = 0.1 + +@dataclass +class DataConfig: + batch_size: int + num_workers: int + data_path: str + augmentation: Dict[str, Any] = None + +@dataclass +class TrainingConfig: + learning_rate: float + epochs: int + optimizer: str + scheduler: Optional[str] = None + +@run.cli.factory +def create_model_config( + architecture: str = "transformer", + hidden_size: int = 512, + num_layers: int = 6 +) -> ModelConfig: + """Create a standardized model configuration.""" + return ModelConfig( + architecture=architecture, + hidden_size=hidden_size, + num_layers=num_layers + ) + +@run.cli.factory +def create_data_config( + batch_size: int = 32, + data_path: str = "./data" +) -> DataConfig: + """Create a standardized data configuration.""" + return DataConfig( + batch_size=batch_size, + num_workers=4, + data_path=data_path + ) + +@run.cli.factory +def create_training_config( + learning_rate: float = 0.001, + epochs: int = 100 +) -> TrainingConfig: + """Create a standardized training configuration.""" + return TrainingConfig( + learning_rate=learning_rate, + epochs=epochs, + optimizer="adam" + ) + +@run.cli.entrypoint( + help="Complete training pipeline with comprehensive configuration", + default_executor=run.DockerExecutor(container_image="pytorch/pytorch:latest") +) +def train_pipeline( + model: ModelConfig = create_model_config(), + data: DataConfig = create_data_config(), + training: TrainingConfig = create_training_config(), + experiment_name: str = "default_experiment", + seed: int = 42 +): + """Complete training pipeline with comprehensive configuration.""" + print(f"Training {model.architecture} model") + print(f"Hidden size: {model.hidden_size}, Layers: {model.num_layers}") + print(f"Batch size: {data.batch_size}, Data path: {data.data_path}") + print(f"Learning rate: {training.learning_rate}, Epochs: {training.epochs}") + print(f"Experiment: {experiment_name}, Seed: {seed}") + + # Your training logic here + return {"status": "completed", "accuracy": 0.95} +``` + +### Experiment Entry Points + +Create entry points for multi-task experiments: + +```python +@run.cli.entrypoint(type="experiment") +def multi_stage_training( + ctx: run.cli.RunContext, + pretrain: run.Partial[train_pipeline] = run.Partial( + train_pipeline, + model=create_model_config(architecture="transformer", hidden_size=768), + training=create_training_config(epochs=50) + ), + finetune: run.Partial[train_pipeline] = run.Partial( + train_pipeline, + model=create_model_config(architecture="transformer", hidden_size=768), + training=create_training_config(epochs=10, learning_rate=1e-5) + ) +): + """Multi-stage training experiment.""" + # Pretrain stage + pretrain_result = ctx.run(pretrain) + + # Finetune stage + finetune_result = ctx.run(finetune) + + return { + "pretrain": pretrain_result, + "finetune": finetune_result + } +``` + +## Best Practices + +### 1. Use Descriptive Help Text + +```python +@run.cli.entrypoint( + help="Train a machine learning model with configurable hyperparameters and advanced features" +) +def train_model(model: str, epochs: int = 10): + """Train a machine learning model with comprehensive logging and validation.""" + pass +``` + +### 2. Provide Sensible Defaults + +```python +@run.cli.entrypoint +def train_model( + model: str, + learning_rate: float = 0.001, # Sensible default for most models + batch_size: int = 32, # Good balance of memory and speed + epochs: int = 10, # Reasonable training duration + seed: int = 42 # Reproducible default +): + pass +``` + +### 3. Use Type Hints Consistently + +```python +from typing import Optional, List, Dict, Any + +@run.cli.entrypoint +def process_data( + input_path: str, + output_path: str, + batch_size: int = 32, + num_workers: int = 4, + config: Optional[Dict[str, Any]] = None +): + pass +``` + +### 4. Create Reusable Factories + +```python +@run.cli.factory +def create_model_config( + model_type: str = "transformer", + hidden_size: int = 512, + num_layers: int = 6, + dropout: float = 0.1 +) -> ModelConfig: + """Create a reusable model configuration with validation.""" + if hidden_size <= 0: + raise ValueError("hidden_size must be positive") + if num_layers <= 0: + raise ValueError("num_layers must be positive") + if not 0 <= dropout <= 1: + raise ValueError("dropout must be between 0 and 1") + + return ModelConfig( + model_type=model_type, + hidden_size=hidden_size, + num_layers=num_layers, + dropout=dropout + ) +``` + +### 5. Handle Complex Configurations + +```python +@run.cli.entrypoint +def complex_training( + model_config: ModelConfig = create_model_config(), + optimizer_config: OptimizerConfig = create_optimizer(), + data_config: DataConfig = create_data_config(), + training_config: TrainingConfig = create_training_config() +): + """Handle complex nested configurations with validation.""" + # Validate configuration compatibility + if data_config.batch_size % model_config.hidden_size != 0: + raise ValueError("batch_size must be divisible by hidden_size") + + print(f"Model: {model_config}") + print(f"Optimizer: {optimizer_config}") + print(f"Data: {data_config}") + print(f"Training: {training_config}") +``` + +### 6. Use Configuration Export for Reproducibility + +```python +# Export configuration for reproducibility +python script.py train_pipeline --to-yaml experiment_config.yaml + +# Load and modify configuration +python script.py train_pipeline --yaml experiment_config.yaml model.hidden_size=1024 +``` + +## Troubleshooting + +### Common Issues + +1. **Type Conversion Errors** + + ```bash + # Error: Cannot convert string to int + python script.py batch_size=32.5 # Should be int + + # Fix: Use explicit type + python script.py batch_size=32 + ``` + +2. **Nested Configuration Issues** + + ```bash + # Error: Cannot set nested attribute + python script.py model.config.hidden_size=512 + + # Fix: Use factory or direct assignment + python script.py model=create_model(hidden_size=512) + ``` + +3. **Factory Resolution Issues** + + ```bash + # Error: Factory not found + python script.py optimizer=unknown_factory() + + # Fix: Use registered factory + python script.py optimizer=create_optimizer() + ``` + +4. **Executor Configuration Issues** + + ```bash + # Error: Invalid executor parameter + python script.py executor=run.SlurmExecutor(invalid_param=value) + + # Fix: Check executor documentation for valid parameters + python script.py executor=run.SlurmExecutor(partition=gpu,time=02:00:00) + ``` + +### Debug Strategies + +1. **Use `--verbose` for detailed output** + + ```bash + python script.py train_model --verbose + ``` + +2. **Use `--dryrun` to preview execution** + + ```bash + python script.py train_model --dryrun + ``` + +3. **Use `--repl` for interactive debugging** + + ```bash + python script.py train_model --repl + ``` + +4. **Export configurations to inspect them** + + ```bash + python script.py train_model --to-yaml debug_config.yaml + ``` + +5. **Check available factories** + + ```python + import nemo_run as run + factories = run.cli.list_factories() + print(factories) + ``` + +## Examples + +### Complete Example: Advanced Training Pipeline + +```python +import nemo_run as run +from dataclasses import dataclass +from typing import Optional, List, Dict, Any + +@dataclass +class ModelConfig: + name: str + hidden_size: int + num_layers: int + dropout: float = 0.1 + +@dataclass +class OptimizerConfig: + type: str + lr: float + weight_decay: float = 1e-5 + betas: List[float] = None + +@dataclass +class DataConfig: + path: str + batch_size: int + num_workers: int = 4 + +@run.cli.factory +def create_model(name: str, hidden_size: int = 512) -> ModelConfig: + return ModelConfig(name=name, hidden_size=hidden_size, num_layers=6) + +@run.cli.factory +def create_optimizer(optimizer: str = "adam", lr: float = 0.001) -> OptimizerConfig: + betas = [0.9, 0.999] if optimizer == "adam" else [0.0, 0.0] + return OptimizerConfig(type=optimizer, lr=lr, betas=betas) + +@run.cli.factory +def create_data(data_path: str, batch_size: int = 32) -> DataConfig: + return DataConfig(path=data_path, batch_size=batch_size) + +@run.cli.entrypoint( + help="Advanced training pipeline with comprehensive configuration and validation", + default_executor=run.DockerExecutor(container_image="pytorch/pytorch:latest") +) +def advanced_training_pipeline( + model: ModelConfig = create_model(name="transformer"), + optimizer: OptimizerConfig = create_optimizer(optimizer="adam", lr=0.001), + data: DataConfig = create_data(data_path="./data", batch_size=32), + epochs: int = 10, + save_path: str = "./models", + experiment_name: str = "default_experiment", + seed: int = 42, + debug: bool = False +): + """Advanced training pipeline with comprehensive configuration.""" + print(f"=== Training Configuration ===") + print(f"Model: {model.name} (hidden_size={model.hidden_size}, layers={model.num_layers})") + print(f"Optimizer: {optimizer.type} (lr={optimizer.lr}, weight_decay={optimizer.weight_decay})") + print(f"Data: {data.path} (batch_size={data.batch_size}, workers={data.num_workers})") + print(f"Training: {epochs} epochs, save_path={save_path}") + print(f"Experiment: {experiment_name}, Seed: {seed}, Debug: {debug}") + + # Validation + if model.hidden_size <= 0: + raise ValueError("hidden_size must be positive") + if optimizer.lr <= 0: + raise ValueError("learning_rate must be positive") + if data.batch_size <= 0: + raise ValueError("batch_size must be positive") + + # Your training logic here + return { + "status": "completed", + "accuracy": 0.95, + "loss": 0.1, + "config": { + "model": model, + "optimizer": optimizer, + "data": data, + "training": { + "epochs": epochs, + "save_path": save_path, + "experiment_name": experiment_name, + "seed": seed + } + } + } +``` + +### CLI Usage Examples + +```bash +# Use defaults +python script.py advanced_training_pipeline + +# Customize components +python script.py advanced_training_pipeline \ + model=create_model(name=resnet50,hidden_size=1024) \ + optimizer=create_optimizer(optimizer=sgd,lr=0.01) \ + data=create_data(data_path=/path/to/data,batch_size=64) \ + epochs=20 \ + save_path=/path/to/save \ + experiment_name=resnet_experiment + +# Export configuration +python script.py advanced_training_pipeline --to-yaml config.yaml + +# Dry run +python script.py advanced_training_pipeline --dryrun + +# Interactive mode +python script.py advanced_training_pipeline --repl + +# Detached execution +python script.py advanced_training_pipeline --detach + +# Follow logs +python script.py advanced_training_pipeline --tail-logs +``` + +This CLI system provides a powerful and flexible way to interact with NeMo Run, making it easy to create command-line tools for your ML workflows while maintaining the full power of Python's type system and configuration capabilities. The system is designed to be intuitive for AI researchers while providing the robustness and reproducibility needed for serious research workflows. diff --git a/docs/reference/faqs.md b/docs/reference/faqs.md new file mode 100644 index 00000000..a7fdc4a9 --- /dev/null +++ b/docs/reference/faqs.md @@ -0,0 +1,475 @@ +--- +description: "Frequently asked questions about NeMo Run" +tags: ["FAQs", "troubleshooting", "help", "configuration", "execution", "management"] +categories: ["help"] +--- + +(faqs)= + +# Frequently Asked Questions + +This section provides comprehensive answers to common questions about NeMo Run, organized by functionality and complexity. + +## Getting Started + +### **Q:** What is NeMo Run and when should I use it? + +**A:** NeMo Run is a Python framework designed for distributed machine learning experimentation and execution. It provides: + +- **Unified Configuration Management**: Use `run.Config` and `run.Partial` for type-safe, serializable configurations +- **Multi-Platform Execution**: Support for local, Slurm, Kubernetes, Docker, and cloud platforms +- **Automatic Code Packaging**: Git-based packaging for reproducible experiments +- **Built-in Logging and Monitoring**: Centralized experiment tracking and log retrieval + +Use NeMo Run when you need to: + +- Run ML experiments across different compute environments +- Ensure reproducibility through configuration management +- Scale experiments from local development to production clusters +- Maintain consistent logging and monitoring across platforms + +### **Q:** How do I install and set up NeMo Run? + +**A:** Install NeMo Run using pip from the GitHub repository: + +```bash +pip install git+https://github.com/NVIDIA-NeMo/Run.git +``` + +Basic setup involves: + +1. **Configure your environment**: + + ```bash + export NEMORUN_HOME=~/.nemo_run # Optional: customize home directory + ``` + +2. **Initialize a project**: + + ```python + import nemo_run as run + + # Basic configuration + config = run.Config(YourModel, learning_rate=0.001, batch_size=32) + ``` + +3. **Choose an executor**: + + ```python + # Local execution + executor = run.LocalExecutor() + + # Remote execution + executor = run.SlurmExecutor( + partition="gpu", + nodes=1, + gpus_per_node=4 + ) + ``` + +## Configuration and Serialization + +### **Q:** What's the difference between `run.Config` and `run.Partial`? + +**A:** Both are configuration primitives, but they serve different purposes: + +- **`run.Config`**: Creates a complete configuration for a class or function + + ```python + model_config = run.Config( + MyModel, + hidden_size=512, + num_layers=6, + dropout=0.1 + ) + ``` + +- **`run.Partial`**: Creates a partially applied function with some arguments fixed + + ```python + train_fn = run.Partial( + train_model, + optimizer="adam", + learning_rate=0.001 + ) + # Can be called later with additional arguments + ``` + +### **Q:** How do I handle serialization errors with complex objects? + +**A:** NeMo Run uses Fiddle's serialization system, which requires all configuration values to be JSON-serializable. Common issues and solutions: + +**Problem**: Non-serializable objects like `pathlib.Path`: + +```python +# ❌ This will fail +config = run.Config(MyClass, data_path=Path("/tmp/data")) +``` + +**Solution**: Wrap non-serializable objects in `run.Config`: + +```python +# ✅ This works +config = run.Config(MyClass, data_path=run.Config(Path, "/tmp/data")) +``` + +**Problem**: Custom classes or complex objects: + +```python +# ❌ This will fail +config = run.Config(MyClass, custom_obj=MyCustomObject()) +``` + +**Solution**: Create factory functions or use `run.Partial`: + +```python +# ✅ Using a factory function +def create_custom_obj(param1, param2): + return MyCustomObject(param1, param2) + +config = run.Config(MyClass, custom_obj=run.Config(create_custom_obj, "value1", "value2")) +``` + +### **Q:** How do I validate my configuration before execution? + +**A:** Use the serialization round-trip test to validate configurations: + +```python +from nemo_run.config import ZlibJSONSerializer + +def validate_config(config): + """Validate that a configuration can be serialized and deserialized.""" + serializer = ZlibJSONSerializer() + + try: + # Serialize and deserialize + serialized = serializer.serialize(config) + deserialized = serializer.deserialize(serialized) + + # Verify equality + assert config == deserialized, "Configuration changed during serialization" + print("✅ Configuration is valid") + return True + + except Exception as e: + print(f"❌ Configuration validation failed: {e}") + return False + +# Usage +config = run.Config(MyModel, param1="value1", param2=run.Config(Path, "/tmp")) +validate_config(config) +``` + +### **Q:** How do I handle control flow in configurations? + +**A:** NeMo Run's `@run.autoconvert` decorator doesn't support control flow constructs like list comprehensions or loops. Here are the recommended approaches: + +**Problem**: Control flow in `@run.autoconvert`: + +```python +# ❌ This will fail +@run.autoconvert +def create_dataset(): + return Dataset( + paths=[Path(f"data_{i}.txt") for i in range(10)], # List comprehension + weights=[1.0 for _ in range(10)] + ) +``` + +**Solution 1**: Use `run.Config` directly: + +```python +# ✅ Direct configuration +def create_dataset_config(): + return run.Config( + Dataset, + paths=[run.Config(Path, f"data_{i}.txt") for i in range(10)], + weights=[1.0 for _ in range(10)] + ) +``` + +**Solution 2**: Use factory functions: + +```python +# ✅ Factory function approach +def create_paths(num_files): + return [run.Config(Path, f"data_{i}.txt") for i in range(num_files)] + +def create_dataset_config(): + return run.Config( + Dataset, + paths=create_paths(10), + weights=[1.0 for _ in range(10)] + ) +``` + +## Execution and Backends + +### **Q:** How does NeMo Run package my code for remote execution? + +**A:** NeMo Run uses packagers to bundle your code and dependencies for remote execution: + +- **`run.Packager`**: Pass-through packager (no modification) +- **`run.GitArchivePackager`**: Packages Git repository using `git archive` +- **`run.PatternPackager`**: Packages files based on pattern matching +- **`run.HybridPackager`**: Combines multiple packagers + +Example: + +```python +# Git-based packaging +packager = run.GitArchivePackager(subpath="src") +executor = run.SlurmExecutor(packager=packager) + +# Pattern-based packaging +packager = run.PatternPackager( + include_pattern="src/**", + relative_path=os.getcwd() +) +executor = run.DockerExecutor(packager=packager) +``` + +### **Q:** What execution backends does NeMo Run support? + +**A:** NeMo Run supports multiple execution backends: + +- **`run.LocalExecutor`**: Local process execution +- **`run.DockerExecutor`**: Docker container execution +- **`run.SlurmExecutor`**: HPC cluster execution via Slurm +- **`run.SkypilotExecutor`**: Multi-cloud execution with cost optimization +- **`run.DGXCloudExecutor`**: NVIDIA DGX Cloud execution +- **`run.LeptonExecutor`**: Lepton cloud execution + +Each executor supports different packaging strategies and resource configurations. + +### **Q:** How do I configure Slurm execution? + +**A:** Configure Slurm execution with the `run.SlurmExecutor`: + +```python +executor = run.SlurmExecutor( + partition="gpu", + nodes=2, + gpus_per_node=4, + time="02:00:00", + job_name="my_experiment", + account="my_account" +) +``` + +For SSH tunnel access from your local machine: + +```python +from nemo_run.core.execution.slurm import SSHTunnel + +tunnel = SSHTunnel( + host="cluster.example.com", + username="your_username", + port=22 +) + +executor = run.SlurmExecutor( + partition="gpu", + tunnel=tunnel +) +``` + +### **Q:** How does NeMo Run handle logging and experiment tracking? + +**A:** NeMo Run provides centralized logging and experiment management: + +- **Automatic Log Capture**: All stdout/stderr is captured and stored +- **Experiment Metadata**: Configuration, timestamps, and status are tracked +- **Log Retrieval**: Access logs through the experiment interface +- **Centralized Storage**: All data stored in `NEMORUN_HOME` + +Example: + +```python +# Launch experiment +experiment = run.submit(task_config, executor) + +# Monitor progress +print(f"Status: {experiment.status}") +print(f"Logs: {experiment.logs}") + +# Retrieve logs +logs = run.get_logs(experiment) +print(f"Exit code: {logs.exit_code}") +print(f"Output: {logs.stdout}") +``` + +## Troubleshooting + +### **Q:** How do I debug configuration issues? + +**A:** Use these debugging techniques: + +1. **Validate Configuration**: + ```python + from nemo_run.config import ZlibJSONSerializer + serializer = ZlibJSONSerializer() + + try: + serialized = serializer.serialize(config) + print("✅ Configuration is serializable") + except Exception as e: + print(f"❌ Serialization failed: {e}") + ``` + +2. **Check Type Hints**: + ```python + import inspect + sig = inspect.signature(MyFunction) + print(sig.parameters) + ``` + +3. **Test CLI Parsing**: + ```bash + python script.py --help + python script.py --dryrun param1=value1 + ``` + +### **Q:** How do I resolve dependency conflicts? + +**A:** Resolve dependency conflicts with these approaches: + +1. **Use Virtual Environment**: + ```bash + python -m venv nemo-run-env + source nemo-run-env/bin/activate + pip install git+https://github.com/NVIDIA-NeMo/Run.git + ``` + +2. **Install with --no-deps**: + ```bash + pip install git+https://github.com/NVIDIA-NeMo/Run.git --no-deps + pip install inquirerpy catalogue fabric fiddle torchx typer rich jinja2 cryptography networkx omegaconf leptonai packaging toml + ``` + +3. **Use Compatible Versions**: + ```bash + pip install "torchx>=0.7.0" "fiddle>=0.3.0" "omegaconf>=2.3.0" + ``` + +### **Q:** How do I recover from experiment failures? + +**A:** NeMo Run provides several recovery mechanisms: + +1. **Check Experiment Status**: + ```python + experiment = run.get_experiment(experiment_id) + print(f"Status: {experiment.status}") + print(f"Error: {experiment.error}") + ``` + +2. **Retrieve Logs**: + ```python + logs = run.get_logs(experiment) + print(f"Exit code: {logs.exit_code}") + print(f"Error output: {logs.stderr}") + ``` + +3. **Restart with Modified Config**: + ```python + # Modify configuration based on error + new_config = config.clone() + new_config.learning_rate = 0.0001 + + # Restart experiment + new_experiment = run.submit(new_config, executor) + ``` + +### **Q:** How do I manage NeMo Run home directory issues? + +**A:** NeMo Run home directory issues can be resolved by: + +1. **Check Current Home**: + ```bash + echo $NEMORUN_HOME + ls ~/.nemo_run/experiments/ + ``` + +2. **Reset Home Directory**: + ```bash + export NEMORUN_HOME=~/.nemo_run + mkdir -p ~/.nemo_run + ``` + +3. **Recover from Backup**: + ```bash + export NEMORUN_HOME=/path/to/original/home + cp -r ~/.nemo_run.backup ~/.nemo_run + ``` + +## Advanced Topics + +### **Q:** How do I create custom executors? + +**A:** Create custom executors by inheriting from `run.Executor`: + +```python +from nemo_run.core.execution.base import Executor + +class CustomExecutor(Executor): + def __init__(self, custom_param: str): + self.custom_param = custom_param + + def submit(self, task_config, **kwargs): + # Custom submission logic + pass + + def get_status(self, job_id): + # Custom status checking + pass +``` + +### **Q:** How do I integrate with external experiment tracking? + +**A:** Integrate with external tracking systems using plugins: + +```python +from nemo_run.run.plugin import ExperimentPlugin + +class WandBPlugin(ExperimentPlugin): + def on_experiment_start(self, experiment): + import wandb + wandb.init(project="my_project") + + def on_experiment_end(self, experiment): + import wandb + wandb.finish() + +# Use plugin +executor = run.LocalExecutor(plugins=[WandBPlugin()]) +``` + +### **Q:** How do I optimize performance for large-scale experiments? + +**A:** Optimize performance with these strategies: + +1. **Use Efficient Packagers**: + ```python + # Use Git packager for large codebases + packager = run.GitArchivePackager(subpath="src") + ``` + +2. **Configure Resource Limits**: + ```python + executor = run.SlurmExecutor( + partition="gpu", + nodes=4, + gpus_per_node=8, + memory="64GB" + ) + ``` + +3. **Use Parallel Execution**: + ```python + experiment = run.Experiment() + for config in configs: + experiment.add_task(config, executor) + experiment.launch(sequential=False) + ``` + +This FAQ covers the most common questions about NeMo Run. For more detailed information, refer to the specific guides for [Configuration](../guides/configuration), [CLI Reference](cli.md), [Execution](../guides/execution), and [Management](../guides/management). diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md new file mode 100644 index 00000000..ede80c7b --- /dev/null +++ b/docs/reference/glossary.md @@ -0,0 +1,179 @@ +--- +description: "Technical glossary of NeMo Run-specific concepts, advanced ML infrastructure terms, and implementation details for experienced AI developers." +tags: ["glossary", "terminology", "definitions", "concepts", "reference", "technical", "infrastructure"] +categories: ["reference"] +--- + +(glossary)= + +# NeMo Run Technical Glossary + +This glossary defines NeMo Run-specific technical concepts, advanced ML infrastructure terminology, and implementation details for experienced AI developers and ML engineers. + +## A + +### AppDef (Application Definition) + +A TorchX specification that defines distributed ML application topology, including role definitions, resource specifications, and execution parameters. NeMo Run uses AppDef internally to represent packaged training tasks and inference jobs. + +### Auto-config + +A Fiddle feature that automatically generates configurations for ML models, training functions, and data pipelines based on their signatures and type hints. Simplifies experiment setup by inferring configuration parameters. + +## C + +### Config + +**`run.Config`** is a NeMo Run primitive that creates type-safe configurations for ML models, training functions, and data pipelines using Fiddle. Ensures reproducibility and validation of experiment parameters. + +### Context Manager + +NeMo Run's `Experiment` class implements Python's context manager protocol, requiring `with Experiment() as exp:` syntax. This ensures proper resource management and experiment lifecycle control. + +## D + +### DGXCloudExecutor + +An executor that submits ML workloads to NVIDIA DGX Cloud clusters via REST API. Supports multi-node distributed training with automatic authentication, project/cluster discovery, and PVC-based storage management. + +### Direct Execution + +A NeMo Run execution mode where tasks run in the same process without packaging or remote execution. Used for debugging and local development with `direct=True` parameter. + +### Dryrun + +A NeMo Run execution mode that shows what would be executed without actually running the task. Useful for debugging configurations and understanding execution plans. + +## E + +### Execution Unit + +A NeMo Run concept consisting of a task configuration paired with an executor. This separation allows running the same task on different platforms and mixing tasks and executors. + +### Experiment + +A **`run.Experiment`** is a NeMo Run object that manages multiple related ML tasks, hyperparameter sweeps, or model variants. Provides experiment-level coordination and metadata tracking. + +### Experiment ID + +A unique identifier for each ML experiment. Used for organizing checkpoints, logs, metrics, and artifacts across distributed training runs. + +### Executor + +A NeMo Run component that defines how and where ML workloads execute. Handles resource allocation, environment setup, and job submission to different compute backends. + +## F + +### Fault Tolerance + +A launcher that provides automatic restart capabilities for distributed training. Handles node failures, network issues, and other transient errors common in large-scale ML training. + +### Fiddle + +A Python library for configuration management that provides type-safe, composable configurations. NeMo Run uses Fiddle as the foundation for ML experiment configuration. + +## G + +### GitArchivePackager + +A packager that uses `git archive` to package version-controlled ML code for remote execution. Ensures only committed changes are deployed and maintains repository structure. + +### HybridPackager + +A packager that combines multiple packaging strategies for complex ML codebases. Allows different packaging approaches for models, data processing, and utilities. + +## L + +### Launcher + +A component that determines how ML tasks execute within their environment. Common launchers include `torchrun` for distributed PyTorch training and `FaultTolerance` for resilient execution. + +### LeptonExecutor + +An executor that submits ML workloads to NVIDIA DGX Cloud Lepton clusters via Python SDK. Supports resource shape-based scheduling, node group affinity, and automatic data movement between job storage and persistent volumes. + +## M + +### Metadata + +Information about ML experiments, jobs, and tasks automatically captured by NeMo Run. Includes hyperparameters, training metrics, environment details, and results. + +## N + +### NEMORUN_HOME + +The root directory where NeMo Run stores experiment metadata, logs, and artifacts. Defaults to `~/.nemo_run` and can be configured via environment variable. + +### NeMo Run + +A comprehensive Python framework for configuring, executing, and managing machine learning experiments across diverse computing environments. Built for AI developers with a focus on reproducibility and scalability. + +## P + +### Packager + +A component responsible for bundling ML code, models, and dependencies for remote execution. Supports various strategies for code deployment across different environments. + +### Partial + +**`run.Partial`** is a NeMo Run primitive that creates partially applied ML functions with fixed hyperparameters. Enables reusable training configurations with default parameters. + +### PatternPackager + +A packager that uses file patterns to selectively package ML code. Useful for large codebases where you need fine-grained control over what gets deployed. + +### Plugin + +An **ExperimentPlugin** that extends NeMo Run functionality for custom ML workflows. Can add monitoring, logging, or custom execution behavior. + +## R + +### Ray + +A distributed computing framework for ML workloads. NeMo Run integrates with Ray for scalable distributed training and hyperparameter tuning. + +### RayCluster + +A persistent Ray cluster for interactive ML development. Provides long-lived compute resources for iterative experimentation and model development. + +### RayJob + +An ephemeral Ray job for batch ML processing. Automatically terminates after completion, ideal for automated training pipelines and inference jobs. + +### Reproducibility + +The ability to recreate exact ML experiment conditions and results. NeMo Run ensures reproducibility through comprehensive configuration management and metadata capture. + +### RunContext + +A NeMo Run CLI concept that manages execution settings, including executor configurations, plugins, and execution parameters for command-line interfaces. + +### run.run() + +A NeMo Run function for single task execution. Provides a simple interface for running configured functions with optional executors and plugins. + +## S + +### Script + +**`run.Script`** is a NeMo Run primitive for executing custom ML scripts and commands. Provides flexibility for legacy workflows or custom training pipelines. + +### SlurmExecutor + +An executor that submits ML jobs to Slurm clusters. Supports containerized execution and integration with high-performance computing environments. + +## T + +### Torchrun + +A launcher that uses PyTorch's `torchrun` command for distributed training. Handles process coordination, rendezvous, and distributed communication for multi-GPU training. + +### Tunnel + +A secure communication channel between the local NeMo Run client and remote execution environments. Supports both SSH tunnels and local tunnels for secure ML job submission. + +## U + +### UV + +A fast Python package manager that NeMo Run can use for dependency management. Provides reliable package installation for ML environments with complex dependency requirements. diff --git a/docs/reference/index.md b/docs/reference/index.md index 99bf6b77..7e0cb818 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -1,10 +1,76 @@ --- -author: lawrence lane -description: "Access comprehensive reference documentation including API specifications, configuration options, and technical details." -tags: ["reference", "api", "configuration", "specifications"] -categories: ["reference", "onboarding"] +description: "Reference documentation for NeMo Run including CLI reference, FAQs, and troubleshooting guides." +tags: ["reference", "cli", "faqs", "troubleshooting", "api"] +categories: ["reference"] --- -(ref-overview)= -# Overview +(reference)= +# NeMo Run References + +This section contains comprehensive reference documentation for NeMo Run. + +## Overview + +The reference section provides detailed documentation for NeMo Run, including command-line interface reference, frequently asked questions, troubleshooting guides, and other reference materials. + +## Reference Materials + +::::{grid} 1 1 1 3 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`terminal;1.5em;sd-mr-1` CLI Reference +:link: cli +:link-type: doc +:link-alt: CLI Reference + +Transform Python functions into CLI tools with rich argument parsing, factory functions, and executor integration. +::: + +:::{grid-item-card} {octicon}`code;1.5em;sd-mr-1` API Reference +:link: ../guides/ray#api-reference +:link-type: doc +:link-alt: API Reference + +Comprehensive API documentation for Ray clusters, jobs, and distributed computing components. +::: + +:::{grid-item-card} {octicon}`question;1.5em;sd-mr-1` FAQs +:link: faqs +:link-type: doc +:link-alt: Frequently Asked Questions + +Common questions about NeMo Run usage, configuration, execution, and troubleshooting. +::: + +:::: + +::::{grid} 1 1 1 2 +:gutter: 1 1 1 2 + +:::{grid-item-card} {octicon}`tools;1.5em;sd-mr-1` Troubleshooting +:link: troubleshooting +:link-type: doc +:link-alt: Troubleshooting Guide + +Solutions for common issues, error messages, debugging techniques, and performance optimization. +::: + +:::{grid-item-card} {octicon}`book;1.5em;sd-mr-1` Glossary +:link: glossary +:link-type: doc +:link-alt: NeMo Run Glossary + +Technical glossary of NeMo Run-specific concepts, ML infrastructure terms, and implementation details. +::: + +:::: + +## What You'll Find + +- **CLI Commands**: Complete reference for all command-line tools and options +- **Common Issues**: Solutions for frequently encountered problems +- **Error Messages**: Explanations and resolutions for error codes +- **Best Practices**: Recommended approaches for various use cases + +For comprehensive guides on configuration, execution, and management, see the [NeMo Run Guides](../guides/index). diff --git a/docs/reference/troubleshooting.md b/docs/reference/troubleshooting.md new file mode 100644 index 00000000..1bd52281 --- /dev/null +++ b/docs/reference/troubleshooting.md @@ -0,0 +1,501 @@ +--- +description: "Comprehensive troubleshooting guide for NeMo Run covering common issues, error messages, debugging techniques, and solutions." +tags: ["troubleshooting", "debugging", "errors", "solutions", "help", "support"] +categories: ["help"] +--- + +(troubleshooting)= + +# Troubleshooting NeMo Run + +This guide helps you diagnose and resolve common issues when using NeMo Run. It covers error messages, debugging techniques, and solutions for various scenarios. + +## Quick Diagnostic Commands + +### Check NeMo Run Status + +Run these commands to quickly assess your NeMo Run installation: + +```bash +# Check NeMo Run installation +python -c "import nemo_run; print(nemo_run.__version__ if hasattr(nemo_run, '__version__') else 'Version not available')" + +# Check environment variables +echo $NEMORUN_HOME + +# Check Python environment +python -c "import nemo_run as run; print(dir(run))" +``` + +## Common Issues and Solutions + +### Installation Issues + +#### Package Installation Problems + +**Problem**: Unable to install NeMo Run from GitHub + +**Solution**: Use the correct installation method: + +```bash +# ✅ Correct installation +pip install git+https://github.com/NVIDIA-NeMo/Run.git + +# ❌ Incorrect (this package doesn't exist) +pip install nemo-run +``` + +**Problem**: Git installation fails + +**Solution**: Ensure Git is available and use HTTPS: + +```bash +# Check Git installation +git --version + +# Use HTTPS instead of SSH +pip install git+https://github.com/NVIDIA-NeMo/Run.git + +# Or install manually +git clone https://github.com/NVIDIA-NeMo/Run.git +cd Run +pip install . +``` + +#### Dependency Conflicts + +**Problem**: Version conflicts with dependencies + +**Solution**: Install with compatible versions: + +```bash +# Install with --no-deps and resolve manually +pip install git+https://github.com/NVIDIA-NeMo/Run.git --no-deps + +# Install core dependencies +pip install inquirerpy catalogue fabric fiddle torchx typer rich jinja2 cryptography networkx omegaconf leptonai packaging toml + +# Install optional dependencies +pip install "skypilot[kubernetes]>=0.9.2" +pip install "ray[kubernetes]" +``` + +### Configuration Issues + +#### Serialization Errors + +**Problem**: Configuration serialization fails + +**Solution**: Wrap non-serializable objects in `run.Config`: + +```python +# ❌ This will fail +partial = run.Partial(some_function, something=Path("/tmp")) + +# ✅ Correct: Wrap in run.Config +partial = run.Partial(some_function, something=run.Config(Path, "/tmp")) +``` + +**Problem**: Complex object serialization + +**Solution**: Use factory functions or `run.Partial`: + +```python +from nemo_run.config import ZlibJSONSerializer + +# Test serialization +serializer = ZlibJSONSerializer() +partial = run.Partial(some_function, something=run.Config(Path, "/tmp")) + +try: + serialized = serializer.serialize(partial) + print("✅ Configuration serializes successfully") +except Exception as e: + print(f"❌ Serialization failed: {e}") +``` + +#### Control Flow Issues + +**Problem**: Control flow constructs in `@run.autoconvert` + +**Solution**: Use `run.Config` directly or factory functions: + +```python +# ❌ This will fail +@run.autoconvert +def control_flow_config() -> run.Config[llm.PreTrainingDataModule]: + return run.Config( + llm.PreTrainingDataModule, + paths=[Path(f"some_doc_{i}") for i in range(10)], # List comprehension + weights=[1.0 for _ in range(10)] + ) + +# ✅ Correct: Use run.Config directly +def control_flow_config() -> run.Config[llm.PreTrainingDataModule]: + return run.Config( + llm.PreTrainingDataModule, + paths=[run.Config(Path, f"some_doc_{i}") for i in range(10)], + weights=[1.0 for _ in range(10)] + ) +``` + +### Execution Issues + +#### Packager Problems + +**Problem**: Code not packaged correctly + +**Solution**: Check packager configuration: + +```python +# Test packager +packager = run.GitArchivePackager(subpath="src") +executor = run.LocalExecutor(packager=packager) + +# Verify Git repository +git status +git add . +git commit -m "Test commit" +``` + +**Problem**: Files missing from package + +**Solution**: Use appropriate packager: + +```python +# For non-Git repositories +packager = run.PatternPackager( + include_pattern="src/**", + relative_path=os.getcwd() +) +executor = run.DockerExecutor(packager=packager) +``` + +#### Executor Configuration Issues + +**Problem**: Slurm executor fails + +**Solution**: Check Slurm configuration: + +```python +executor = run.SlurmExecutor( + partition="gpu", + nodes=1, + gpus_per_node=4, + time="02:00:00" +) + +# Test with dry run +experiment = run.submit(config, executor) +experiment.dryrun = True +``` + +**Problem**: Docker executor fails + +**Solution**: Check Docker configuration: + +```python +executor = run.DockerExecutor( + container_image="nvidia/pytorch:24.05-py3", + gpus="all" +) + +# Test Docker daemon +import docker +client = docker.from_env() +client.ping() +``` + +**Problem**: SkyPilot executor fails + +**Solution**: Check SkyPilot configuration: + +```python +executor = run.SkypilotExecutor( + cluster_name="my-cluster", + region="us-west1" +) + +# Verify SkyPilot installation +pip list | grep skypilot +``` + +### Logging and Monitoring Issues + +#### Log Retrieval Problems + +**Problem**: Cannot retrieve experiment logs + +**Solution**: Check experiment status and home directory: + +```python +# Check experiment status +experiment = run.get_experiment(experiment_id) +print(f"Status: {experiment.status}") + +# Check logs +logs = run.get_logs(experiment) +print(f"Exit code: {logs.exit_code}") +print(f"Output: {logs.stdout}") +``` + +**Problem**: NeMo Run home directory issues + +**Solution**: Check and fix home directory: + +```bash +# Check current home +echo $NEMORUN_HOME + +# Set correct home +export NEMORUN_HOME=~/.nemo_run + +# Create directory if missing +mkdir -p ~/.nemo_run +``` + +### Network and Connectivity Issues + +#### SSH Tunnel Problems + +**Problem**: SSH tunnel connection fails + +**Solution**: Check SSH configuration: + +```python +from nemo_run.core.execution.slurm import SSHTunnel + +tunnel = SSHTunnel( + host="cluster.example.com", + username="your_username", + port=22 +) + +# Test SSH connection +ssh -T your_username@cluster.example.com +``` + +**Problem**: Network timeout issues + +**Solution**: Configure network timeouts: + +```bash +# Set network timeouts +export NEMORUN_NETWORK_TIMEOUT=60 +export NEMORUN_MAX_CONNECTIONS=50 +``` + +## Debugging Techniques + +### Enable Debug Mode + +Enable comprehensive debugging: + +```bash +# Enable debug logging +export NEMORUN_DEBUG=true +export NEMORUN_LOG_LEVEL=DEBUG + +# Run with verbose output +python -c "import nemo_run; print('Debug mode enabled')" +``` + +### Configuration Validation + +Validate configurations before execution: + +```python +from nemo_run.config import ZlibJSONSerializer + +def validate_config(config): + """Validate configuration serialization.""" + serializer = ZlibJSONSerializer() + + try: + serialized = serializer.serialize(config) + deserialized = serializer.deserialize(serialized) + + if config == deserialized: + print("✅ Configuration is valid") + return True + else: + print("❌ Configuration changed during serialization") + return False + + except Exception as e: + print(f"❌ Configuration validation failed: {e}") + return False + +# Usage +validate_config(my_config) +``` + +### CLI Debugging + +Debug CLI issues: + +```bash +# Test CLI help +python script.py --help + +# Test with dry run +python script.py --dryrun param1=value1 + +# Test with verbose output +python script.py --verbose param1=value1 +``` + +### Executor Testing + +Test executor configurations: + +```python +# Test local executor +executor = run.LocalExecutor() +print("✅ Local executor created") + +# Test Docker executor +executor = run.DockerExecutor(container_image="python:3.9") +print("✅ Docker executor created") + +# Test Slurm executor +executor = run.SlurmExecutor(partition="cpu", time="00:10:00") +print("✅ Slurm executor created") +``` + +## Performance Issues + +### Resource Optimization + +**Problem**: High memory usage + +**Solution**: Configure memory limits: + +```bash +# Set memory limits +export NEMORUN_MAX_MEMORY=8GB +export NEMORUN_MEMORY_POOL_SIZE=2GB + +# Monitor memory usage +free -h +``` + +**Problem**: Slow execution + +**Solution**: Optimize configuration: + +```python +# Use efficient packager +packager = run.GitArchivePackager(subpath="src") + +# Configure resource limits +executor = run.SlurmExecutor( + partition="gpu", + nodes=2, + gpus_per_node=4, + memory="64GB" +) +``` + +### Network Optimization + +**Problem**: Slow network transfers + +**Solution**: Configure network settings: + +```bash +# Enable compression +export NEMORUN_COMPRESSION=true +export NEMORUN_CHUNK_SIZE=1MB + +# Configure timeouts +export NEMORUN_NETWORK_TIMEOUT=30 +export NEMORUN_KEEPALIVE=true +``` + +## Error Message Reference + +### Common Error Messages + +#### Import Errors + +``` +ModuleNotFoundError: No module named 'nemo_run' +``` +**Solution**: Install NeMo Run correctly: +```bash +pip install git+https://github.com/NVIDIA-NeMo/Run.git +``` + +#### Serialization Errors + +``` +TypeError: Object of type Path is not JSON serializable +``` +**Solution**: Wrap in `run.Config`: +```python +config = run.Config(MyClass, path=run.Config(Path, "/tmp")) +``` + +#### Executor Errors + +``` +ExecutorError: Failed to submit job +``` +**Solution**: Check executor configuration and connectivity. + +#### Configuration Errors + +``` +ConfigurationError: Invalid configuration +``` +**Solution**: Validate configuration before execution. + +## Getting Help + +### Diagnostic Information + +When reporting issues, include this diagnostic information: + +```bash +# System information +python --version +pip --version +echo $NEMORUN_HOME + +# NeMo Run information +python -c "import nemo_run; print(f'Version: {nemo_run.__version__ if hasattr(nemo_run, \"__version__\") else \"Version not available\"}')" + +# Environment information +env | grep NEMORUN +``` + +### Reporting Issues + +When reporting issues to the NeMo Run team: + +1. **Include diagnostic information** (see above) +2. **Provide error messages** and stack traces +3. **Describe the steps** to reproduce the issue +4. **Include configuration files** (if applicable) +5. **Specify your environment** (OS, Python version, etc.) + +### Example Issue Report + +``` +NeMo Run Version: 1.0.0 +Python Version: 3.9.7 +OS: Ubuntu 20.04 +NEMORUN_HOME: ~/.nemo_run + +Error: Configuration serialization fails +Steps to reproduce: +1. Create configuration with Path object +2. Attempt to serialize +3. Get TypeError + +Error message: +TypeError: Object of type Path is not JSON serializable +``` + +This troubleshooting guide should help you resolve most common issues with NeMo Run. If you continue to experience problems, please report them with the information requested above. diff --git a/examples/docker/hello_docker.py b/examples/docker/hello_docker.py deleted file mode 100644 index 3e97d482..00000000 --- a/examples/docker/hello_docker.py +++ /dev/null @@ -1,36 +0,0 @@ -import nemo_run as run - -if __name__ == "__main__": - inline_script = run.Script( - inline=""" -echo "Hello 1" -nvidia-smi -sleep 5 -""" - ) - inline_script_sleep = run.Script( - inline=""" -echo "Hello sleep" -sleep infinity -""" - ) - executor = run.DockerExecutor( - container_image="python:3.12", - num_gpus=-1, - runtime="nvidia", - ipc_mode="host", - shm_size="30g", - env_vars={"PYTHONUNBUFFERED": "1"}, - packager=run.Packager(), - ) - with run.Experiment("docker-experiment", executor=executor, log_level="INFO") as exp: - id1 = exp.add([inline_script, inline_script_sleep], tail_logs=False, name="task-1") - id2 = exp.add([inline_script, inline_script_sleep], tail_logs=False, name="task-2") - id3 = exp.add( - [inline_script, inline_script_sleep], - tail_logs=False, - name="task-3", - dependencies=[id1, id2], - ) - - exp.run(detach=False, tail_logs=True, sequential=False) diff --git a/examples/entrypoint/README.md b/examples/entrypoint/README.md deleted file mode 100644 index db853adc..00000000 --- a/examples/entrypoint/README.md +++ /dev/null @@ -1,403 +0,0 @@ -# NeMo Run CLI Entrypoints Tutorial - -## Introduction - -NeMo Run provides a powerful and pythonic Command-Line Interface (CLI) system that allows you to create entrypoints for both individual tasks and sequential experiments. This tutorial will guide you through the process of creating and using CLI entrypoints, demonstrating how to leverage NeMo Run's features to streamline your machine learning workflows. - -## Key Concepts - -Before diving into the examples, let's familiarize ourselves with some key concepts: - -1. **Entrypoints**: Functions decorated with `@run.cli.entrypoint` that serve as the main entry point for your CLI commands. -2. **Factories**: Functions decorated with `@run.cli.factory` that create and configure objects used in your entrypoints. They are registered for specific types and provide a way to create complex objects with default or customized configurations. (See [Step 2](#step-2-create-factory-functions) for more details) -3. **Partials**: Partially configured functions that allow for flexible argument passing and configuration. -4. **Experiments**: A collection of tasks that can be executed sequentially or in parallel. -5. **RunContext**: An object that manages the execution context for experiments, including executor and plugin configurations. - -## Single Task Entrypoint - -Let's start by creating a simple task entrypoint for training a model. - -```python -from dataclasses import dataclass -from typing import List - -@dataclass -class Model: - """Dummy model config""" - hidden_size: int - num_layers: int - activation: str - -@dataclass -class Optimizer: - """Dummy optimizer config""" - learning_rate: float - weight_decay: float - betas: List[float] -``` - -### Step 2: Create Factory Functions - -Next, we'll create factory functions to generate instances of our configuration classes. We'll demonstrate two approaches: one using the `@run.autoconvert` decorator, and one without. - -Here's an example of how to create and use factories: - -```python -import nemo_run as run - -@run.cli.factory -@run.autoconvert -def my_model( - hidden_size: int = 256, - num_layers: int = 3, - activation: str = 'relu' -) -> Model: - """Create a model configuration.""" - return Model(hidden_size=hidden_size, num_layers=num_layers, activation=activation) - -@run.cli.factory -def my_optimizer( - learning_rate: float = 0.001, - weight_decay: float = 1e-5, - betas: Sequence[float] = (0.9, 0.999,) -) -> run.Config[Optimizer]: - """Create an optimizer configuration.""" - return run.Config(Optimizer, learning_rate=learning_rate, weight_decay=weight_decay, betas=list(betas)) -``` - -In this example, we've created two factory functions: `my_model` and `my_optimizer`. Let's break down the two approaches: - -1. Using `@run.autoconvert` (my_model): - - The function is decorated with both `@run.cli.factory` and `@run.autoconvert`. - - The function returns a regular `Model` instance. - - `@run.autoconvert` automatically converts the return value to a `run.Config` object. - - This approach is more concise and allows you to write the function as if you were creating a regular instance. - -2. Without `@run.autoconvert` (my_optimizer): - - The function is only decorated with `@run.cli.factory`. - - The function explicitly returns a `run.Config[Optimizer]` object. - - You have more control over the creation of the `run.Config` object, but it requires more explicit code. - -Both approaches achieve the same result: they create factory functions that return `run.Config` objects. The choice between them depends on your preference and specific use case: - -- Use `@run.autoconvert` when you want to write your factory function in a more natural style, especially for complex objects. -- Use the explicit `run.Config` approach when you need more control over the configuration process or when you're dealing with more complex configuration scenarios. - -Key points about factories: -- They are registered for a specific type (e.g., Model, Optimizer) using `@run.cli.factory`. -- They can have default values and accept custom parameters. -- They return a `run.Config` object, which is then used to create the actual instance. -- Multiple factories can be registered for the same type, allowing for different configuration presets. - -These factory functions can now be used in our entrypoint function to provide default configurations, which can be overridden via CLI arguments. - -### Step 3: Define the task - -Now, let's create our main entrypoint for training the model: - -```python -@run.cli.entrypoint -def train_model( - model: Model = my_model(), - optimizer: Optimizer = my_optimizer(), - epochs: int = 10, - batch_size: int = 32 -) -> None: - """ - Train a model using the specified configuration. - - Args: - model: Configuration for the model. - optimizer: Configuration for the optimizer. - epochs: Number of training epochs. Defaults to 10. - batch_size: Batch size for training. Defaults to 32. - """ - print(f"Training model with the following configuration:") - print(f"Model: {model}") - print(f"Optimizer: {optimizer}") - print(f"Epochs: {epochs}") - print(f"Batch size: {batch_size}") - - # Simulating model training - for epoch in range(epochs): - print(f"Epoch {epoch + 1}/{epochs}") - - print("Training completed!") - -if __name__ == "__main__": - run.cli.main(train_model) -``` - -Let's break down this entrypoint function: - -1. `@run.cli.entrypoint`: This decorator marks the function as a CLI entrypoint, allowing it to be called directly from the command line. - -2. Function arguments: - - `model` and `optimizer` use our factory functions as default values. This means if no values are provided via CLI, these defaults will be used. - - `epochs` and `batch_size` have simple default values. - -3. Type annotations: Each argument has a type annotation, which NeMo Run uses to validate and convert CLI inputs. - -4. Docstring: The function includes a detailed docstring, which will be used to generate CLI help messages. - -5. Function body: This is where you would typically put your actual training logic. In this example, we're just printing the configuration and simulating a training loop. - -6. `if __name__ == "__main__":`: This block ensures the CLI is only run when the script is executed directly. - -7. `run.cli.main(train_model)`: This function call sets up and runs the CLI for our entrypoint. - -By structuring our entrypoint this way, we've created a flexible CLI that can accept various configurations for our model training task. Users can override any of these parameters from the command line, and the factory functions we defined earlier will be used to create the appropriate configurations. - -### Using the Single Task Entrypoint - -You can now use this entrypoint from the command line with various configurations. The CLI system supports a rich, Pythonic syntax that allows for complex configurations directly from the command line. - -1. Print help message: - ``` - python task.py --help - ``` - - ![task-help](./img/task-help.png) - -2. Basic usage with default values: - ``` - python task.py - ``` - - ![task-2](./img/task-2.png) - -3. Modifying specific parameters: - ``` - python task.py model.hidden_size=512 optimizer.learning_rate=0.01 epochs=20 - ``` - - ![task-3](./img/task-3.png) - -4. Using factory functions with custom arguments: - ``` - python task.py model="my_model(hidden_size=1024,activation='tanh')" optimizer="my_optimizer(learning_rate=0.005)" - ``` - - ![task-4](./img/task-4.png) - -5. Combining factory functions and direct parameter modifications: - ``` - python task.py model="my_model(hidden_size=1024)" model.num_layers=5 optimizer.weight_decay=1e-4 - ``` - - ![task-5](./img/task-5.png) - -6. Using Python-like operations on arguments: - ``` - python task.py "model.hidden_size*=2" optimizer.learning_rate/=10 batch_size+=16 - ``` - - ![task-6](./img/task-6.png) - -7. Setting list and dictionary values: - ``` - python task.py optimizer.betas=[0.9,0.999] - ``` - - ![task-7](./img/task-7.png) -8. Automatically open a iPython shell to modify the task configuration: - ``` - python task.py model=my_model optimizer=my_optimizer --repl - ``` - - ![task-repl](./img/task-repl.gif) - -These examples demonstrate the flexibility and Pythonic nature of the CLI system. You can: - -- Use dot notation to access nested attributes -- Call factory functions with custom arguments -- Perform arithmetic operations on numeric values -- Set list and dictionary values directly -- Interactively modify the task configuration using a iPython shell - -This powerful syntax allows you to create complex configurations directly from the command line, making it easy to experiment with different settings without modifying the source code. - -## Experiment Entrypoint - -Now, let's create a more complex entrypoint for a experiment that includes multiple tasks. - -### Step 1: Define the Experiment Entrypoint - -```python -import nemo_run as run -from typing import List - -@run.cli.entrypoint(type="experiment") -def train_models_experiment( - ctx: run.cli.RunContext, - models: Sequence[Model] = (my_model(), my_model(hidden_size=512),), - optimizers: Sequence[Optimizer] = (my_optimizer(), my_optimizer(learning_rate=0.01),), - epochs: int = 10, - batch_size: int = 32, - sequential: bool = False, -) -> None: - """ - Run an experiment to train multiple models with different configurations. - - Args: - ctx: The run context for the experiment. - models: List of model configurations to train. - optimizers: List of optimizer configurations to use. - epochs: Number of training epochs for each model. - batch_size: Batch size for training. - sequential: Whether to run tasks sequentially or in parallel. - """ - with run.Experiment("train_models_experiment") as exp: - for i, (model, optimizer) in enumerate(zip(models, optimizers)): - train = run.Partial( - train_model, model=model, optimizer=optimizer, epochs=epochs, batch_size=batch_size - ) - - exp.add(train, name=f"train_model_{i}", executor=ctx.executor) - - ctx.launch(exp, sequential=sequential) - -if __name__ == "__main__": - run.cli.main(train_models_experiment) -``` - -Let's break down this experiment entrypoint: - -1. `@run.cli.entrypoint(type="experiment")`: This decorator specifies that this is an experiment entrypoint. - -2. `ctx: run.cli.RunContext`: The first parameter is the run context, which manages the experiment execution. - -3. Function arguments: Include model configurations, optimizer configurations, and training parameters. - -4. `with run.Experiment("train_models_experiment") as exp:`: This context manager creates an experiment object. - -5. Inside the experiment context: - - We iterate over models and optimizers, creating a `run.Partial` object for each training task. - - `exp.add()` adds each task to the experiment, specifying a name and executor. - -6. `ctx.launch(exp, sequential=sequential)`: This launches the experiment, with the option to run tasks sequentially or in parallel. - -7. `run.cli.main(train_models_experiment)`: Sets up and runs the CLI for our experiment entrypoint. - -Key benefits of this approach: -- Flexibility: Easily add or modify models and optimizers to be tested. -- Reusability: The `train_model` function is reused for each configuration. -- Scalability: This structure can handle any number of model/optimizer combinations. -- Execution Control: The `sequential` parameter allows control over parallel or sequential execution. -- CLI Integration: All parameters can be adjusted via command-line arguments. - -### Using the Experiment Entrypoint - -You can use this experiment entrypoint from the command line with various configurations. Here are some examples: - -1. Run the experiment with default configurations: - ``` - python experiment.py - ``` - -2. Modify configurations for specific models or optimizers: - ``` - python experiment.py models[0].hidden_size=1024 optimizers[1].learning_rate=0.001 - ``` - -3. Add an additional model to the experiment: - ``` - python experiment.py "models+=[my_model(hidden_size=2048)]" - ``` - -4. Run the experiment with a specific executor: - ``` - python experiment.py ctx.executor=local_executor - ``` - -5. Run the experiment sequentially: - ``` - python experiment.py sequential=True - ``` - -These examples showcase how you can use the CLI to modify the experiment configuration, add or modify tasks, and control the execution environment. The experiment entrypoint provides a powerful way to manage complex workflows with multiple related tasks. - -## Advanced CLI Features - -NeMo Run's CLI system offers several advanced features to enhance your workflow: - -1. **Nested Configurations**: You can modify nested attributes using dot notation: - ``` - python experiment.py model.hidden_size=1024 optimizer.betas=[0.95,0.999] - ``` - -2. **Operations on Arguments**: You can perform operations on existing values: - ``` - python experiment.py model.hidden_size*=2 optimizer.learning_rate/=10 - ``` - -3. **Type Inference**: The CLI automatically infers and converts types based on the function signatures. - -4. **Help and Documentation**: Use the `--help` flag to see detailed information about the entrypoint and its arguments: - ``` - python experiment.py --help - ``` - -5. **Dry Runs**: Use the `--dryrun` flag to see what would be executed without actually running the experiment: - ``` - python experiment.py --dryrun - ``` - -6. **Interactive Mode**: Use the `--repl` flag to enter an interactive Python shell where you can modify the configuration before running: - ``` - python experiment.py --repl - ``` - -7. **Executor Configuration**: Specify different executors and their configurations: - ``` - python experiment.py executor=skypilot_executor executor.instance_type=p3.2xlarge - ``` - -8. **Plugin Support**: Add plugins to extend functionality: - ``` - python experiment.py plugins=wandb_logger plugins.project_name=my_experiment - ``` - -9. **Factory Functions**: Use factory functions to create complex objects with default configurations: - ``` - python experiment.py model=my_model optimizer=my_optimizer - ``` - -10. **Partial Functions**: Create partially configured functions for reuse in experiments: - ```python - train = run.Partial(train_model, model=model, optimizer=optimizer, epochs=train_epochs) - ``` - -## Error Handling - -NeMo Run provides robust error handling to help you identify and fix issues in your CLI usage: - -- `ArgumentParsingError`: Raised when there's an error parsing the initial argument structure. -- `TypeParsingError`: Raised when there's an error parsing the type of an argument. -- `OperationError`: Raised when there's an error performing an operation on an argument. -- `ArgumentValueError`: Raised when the value of a CLI argument is invalid. -- `UndefinedVariableError`: Raised when an operation is attempted on an undefined variable. -- `LiteralParseError`: Raised when parsing a Literal type fails. -- `ListParseError`: Raised when parsing a list fails. -- `DictParseError`: Raised when parsing a dict fails. -- `UnknownTypeError`: Raised when attempting to parse an unknown or unsupported type. - -These exceptions provide detailed error messages to help you quickly identify and resolve issues in your CLI usage. - -## Best Practices - -1. Use descriptive names for your entrypoints and factory functions. -2. Provide default values for arguments to make your CLI more user-friendly. -3. Use type annotations to ensure proper type checking and conversion. -4. Write clear docstrings for your entrypoints and factory functions to generate helpful CLI documentation. -5. Consider creating reusable factory functions for common configurations. -6. Use the `run.Partial` class to create flexible, reusable task configurations. -7. Leverage the `RunContext` object in experiments to manage execution settings and add tasks. -8. Use the `@run.autoconvert` decorator with factory functions to automatically convert returned objects to `run.Config` instances. -9. Take advantage of the `PythonicParser` for handling complex Python-like expressions in CLI arguments. -10. Implement custom parsers for specific types using the `TypeParser.register_parser` method when needed. - -## Conclusion - -NeMo Run's CLI system provides a powerful and flexible way to create and manage machine learning experiments. By leveraging entrypoints, factory functions, and the various CLI features, you can create intuitive and efficient command-line interfaces for your ML workflows. Experiment with different configurations, executors, and plugins to find the best setup for your projects. diff --git a/examples/entrypoint/experiment.py b/examples/entrypoint/experiment.py deleted file mode 100644 index 3e341957..00000000 --- a/examples/entrypoint/experiment.py +++ /dev/null @@ -1,107 +0,0 @@ -from dataclasses import dataclass -from typing import List - -import nemo_run as run - - -@dataclass -class Model: - """Dummy model config""" - - hidden_size: int - num_layers: int - activation: str - - -@dataclass -class Optimizer: - """Dummy optimizer config""" - - learning_rate: float - weight_decay: float - betas: List[float] - - -@run.cli.entrypoint -def train_model(model: Model, optimizer: Optimizer, epochs: int = 10, batch_size: int = 32): - """ - Train a model using the specified configuration. - - Args: - model (Model): Configuration for the model. - optimizer (Optimizer): Configuration for the optimizer. - epochs (int, optional): Number of training epochs. Defaults to 10. - batch_size (int, optional): Batch size for training. Defaults to 32. - """ - print("Training model with the following configuration:") - print(f"Model: {model}") - print(f"Optimizer: {optimizer}") - print(f"Epochs: {epochs}") - print(f"Batch size: {batch_size}") - - # Simulating model training - for epoch in range(epochs): - print(f"Epoch {epoch + 1}/{epochs}") - - print("Training completed!") - - -@run.cli.factory -@run.autoconvert -def my_model(hidden_size: int = 256, num_layers: int = 3, activation: str = "relu") -> Model: - """ - Create a model configuration. - """ - return Model(hidden_size=hidden_size, num_layers=num_layers, activation=activation) - - -@run.cli.factory -@run.autoconvert -def my_optimizer( - learning_rate: float = 0.001, weight_decay: float = 1e-5, betas: List[float] = [0.9, 0.999] -) -> Optimizer: - """ - Create an optimizer configuration. - """ - return Optimizer(learning_rate=learning_rate, weight_decay=weight_decay, betas=betas) - - -@run.cli.factory -@run.autoconvert -def local_executor() -> run.LocalExecutor: - return run.LocalExecutor() - - -@run.cli.entrypoint(type="experiment") -def train_models_experiment( - ctx: run.cli.RunContext, - models: List[Model] = [my_model(), my_model(hidden_size=512)], - optimizers: List[Optimizer] = [my_optimizer(), my_optimizer(learning_rate=0.01)], - epochs: int = 10, - batch_size: int = 32, - sequential: bool = False, -): - """ - Run an experiment to train multiple models with different configurations. - - Args: - ctx (run.RunContext): The run context for the experiment. - models (List[Model]): List of model configurations to train. - optimizers (List[Optimizer]): List of optimizer configurations to use. - epochs (int): Number of training epochs for each model. - batch_size (int): Batch size for training. - """ - - with run.Experiment("train_models_experiment") as exp: - for i, (model, optimizer) in enumerate(zip(models, optimizers)): - train = run.Partial( - train_model, model=model, optimizer=optimizer, epochs=epochs, batch_size=batch_size - ) - - exp.add(train, name=f"train_model_{i}", executor=ctx.executor) - - ctx.launch(exp, sequential=sequential) - - -if __name__ == "__main__": - run.cli.main(train_models_experiment, default_executor=local_executor()) diff --git a/examples/entrypoint/img/experiment-2.png b/examples/entrypoint/img/experiment-2.png deleted file mode 100644 index 6f7ef2cd..00000000 Binary files a/examples/entrypoint/img/experiment-2.png and /dev/null differ diff --git a/examples/entrypoint/img/experiment-3.png b/examples/entrypoint/img/experiment-3.png deleted file mode 100644 index 3e4bb635..00000000 Binary files a/examples/entrypoint/img/experiment-3.png and /dev/null differ diff --git a/examples/entrypoint/img/experiment-4.png b/examples/entrypoint/img/experiment-4.png deleted file mode 100644 index 10316908..00000000 Binary files a/examples/entrypoint/img/experiment-4.png and /dev/null differ diff --git a/examples/entrypoint/img/experiment-5.png b/examples/entrypoint/img/experiment-5.png deleted file mode 100644 index ed8babfd..00000000 Binary files a/examples/entrypoint/img/experiment-5.png and /dev/null differ diff --git a/examples/entrypoint/img/experiment-6.png b/examples/entrypoint/img/experiment-6.png deleted file mode 100644 index 11c5c3ed..00000000 Binary files a/examples/entrypoint/img/experiment-6.png and /dev/null differ diff --git a/examples/entrypoint/img/experiment-help.png b/examples/entrypoint/img/experiment-help.png deleted file mode 100644 index fbea542c..00000000 Binary files a/examples/entrypoint/img/experiment-help.png and /dev/null differ diff --git a/examples/entrypoint/img/task-2.png b/examples/entrypoint/img/task-2.png deleted file mode 100644 index 256622d9..00000000 Binary files a/examples/entrypoint/img/task-2.png and /dev/null differ diff --git a/examples/entrypoint/img/task-3.png b/examples/entrypoint/img/task-3.png deleted file mode 100644 index e574a609..00000000 Binary files a/examples/entrypoint/img/task-3.png and /dev/null differ diff --git a/examples/entrypoint/img/task-4.png b/examples/entrypoint/img/task-4.png deleted file mode 100644 index 521940f3..00000000 Binary files a/examples/entrypoint/img/task-4.png and /dev/null differ diff --git a/examples/entrypoint/img/task-5.png b/examples/entrypoint/img/task-5.png deleted file mode 100644 index 355b10ea..00000000 Binary files a/examples/entrypoint/img/task-5.png and /dev/null differ diff --git a/examples/entrypoint/img/task-6.png b/examples/entrypoint/img/task-6.png deleted file mode 100644 index 2c11ef28..00000000 Binary files a/examples/entrypoint/img/task-6.png and /dev/null differ diff --git a/examples/entrypoint/img/task-7.png b/examples/entrypoint/img/task-7.png deleted file mode 100644 index 7a959fbe..00000000 Binary files a/examples/entrypoint/img/task-7.png and /dev/null differ diff --git a/examples/entrypoint/img/task-help.png b/examples/entrypoint/img/task-help.png deleted file mode 100644 index 8f00f6e7..00000000 Binary files a/examples/entrypoint/img/task-help.png and /dev/null differ diff --git a/examples/entrypoint/img/task-repl.gif b/examples/entrypoint/img/task-repl.gif deleted file mode 100644 index 1b49847e..00000000 Binary files a/examples/entrypoint/img/task-repl.gif and /dev/null differ diff --git a/examples/entrypoint/task.py b/examples/entrypoint/task.py deleted file mode 100644 index ac89c713..00000000 --- a/examples/entrypoint/task.py +++ /dev/null @@ -1,84 +0,0 @@ -from dataclasses import dataclass -from typing import List - -import nemo_run as run - - -@dataclass -class Model: - """Dummy model config""" - - hidden_size: int - num_layers: int - activation: str - - -@dataclass -class Optimizer: - """Dummy optimizer config""" - - learning_rate: float - weight_decay: float - betas: List[float] - - -@run.cli.factory -@run.autoconvert -def my_model(hidden_size: int = 256, num_layers: int = 3, activation: str = "relu") -> Model: - """ - Create a model configuration. - """ - return Model(hidden_size=hidden_size, num_layers=num_layers, activation=activation) - - -@run.cli.factory -def my_optimizer( - learning_rate: float = 0.001, weight_decay: float = 1e-5, betas: List[float] = [0.9, 0.999] -) -> run.Config[Optimizer]: - """Create an optimizer configuration.""" - return run.Config( - Optimizer, learning_rate=learning_rate, weight_decay=weight_decay, betas=betas - ) - - -def train_model( - model: Model, - optimizer: Optimizer, - epochs: int = 10, - batch_size: int = 32, -): - """ - Train a model using the specified configuration. - - Args: - model (Model): Configuration for the model. - optimizer (Optimizer): Configuration for the optimizer. - epochs (int, optional): Number of training epochs. Defaults to 10. - batch_size (int, optional): Batch size for training. Defaults to 32. - """ - print("Training model with the following configuration:") - print(f"Model: {model}") - print(f"Optimizer: {optimizer}") - print(f"Epochs: {epochs}") - print(f"Batch size: {batch_size}") - - # Simulating model training - for epoch in range(epochs): - print(f"Epoch {epoch + 1}/{epochs}") - - print("Training completed!") - - -@run.cli.factory(target=train_model) -def train_recipe() -> run.Partial[train_model]: - return run.Partial( - train_model, - model=my_model(hidden_size=512), - optimizer=my_optimizer(learning_rate=0.0005), - epochs=50, - batch_size=2048, - ) - - -if __name__ == "__main__": - run.cli.main(train_model, cmd_defaults={"skip_confirmation": True}) diff --git a/examples/entrypoint/task.yaml b/examples/entrypoint/task.yaml deleted file mode 100644 index 93726d64..00000000 --- a/examples/entrypoint/task.yaml +++ /dev/null @@ -1,14 +0,0 @@ -model: - _factory_: "my_model" - hidden_size: 256 - num_layers: 3 - activation: "relu" - -optimizer: - _factory_: "my_optimizer" - learning_rate: 0.001 - weight_decay: 1e-5 - betas: [0.9, 0.999] - -epochs: 10 -batch_size: 32 diff --git a/examples/entrypoint/task_with_defaults.py b/examples/entrypoint/task_with_defaults.py deleted file mode 100644 index aaebabf3..00000000 --- a/examples/entrypoint/task_with_defaults.py +++ /dev/null @@ -1,98 +0,0 @@ -from dataclasses import dataclass -from typing import List - -import nemo_run as run - - -@dataclass -class Model: - """Dummy model config""" - - hidden_size: int - num_layers: int - activation: str - - -@dataclass -class Optimizer: - """Dummy optimizer config""" - - learning_rate: float - weight_decay: float - betas: List[float] - - -@run.cli.factory -@run.autoconvert -def my_model(hidden_size: int = 256, num_layers: int = 3, activation: str = "relu") -> Model: - """ - Create a model configuration. - """ - return Model(hidden_size=hidden_size, num_layers=num_layers, activation=activation) - - -@run.cli.factory -def my_optimizer( - learning_rate: float = 0.001, weight_decay: float = 1e-5, betas: List[float] = [0.9, 0.999] -) -> run.Config[Optimizer]: - """Create an optimizer configuration.""" - return run.Config( - Optimizer, learning_rate=learning_rate, weight_decay=weight_decay, betas=betas - ) - - -def train_model( - model: Model, - optimizer: Optimizer, - epochs: int = 10, - batch_size: int = 32, -): - """ - Train a model using the specified configuration. - - Args: - model (Model): Configuration for the model. - optimizer (Optimizer): Configuration for the optimizer. - epochs (int, optional): Number of training epochs. Defaults to 10. - batch_size (int, optional): Batch size for training. Defaults to 32. - """ - print("Training model with the following configuration:") - print(f"Model: {model}") - print(f"Optimizer: {optimizer}") - print(f"Epochs: {epochs}") - print(f"Batch size: {batch_size}") - - # Simulating model training - for epoch in range(epochs): - print(f"Epoch {epoch + 1}/{epochs}") - - print("Training completed!") - - -def custom_defaults() -> run.Partial[train_model]: - return run.Partial( - train_model, - model=my_model(hidden_size=512), - optimizer=my_optimizer(learning_rate=0.0005), - epochs=50, - batch_size=2048, - ) - - -@run.autoconvert -def local_executor() -> run.Executor: - return run.LocalExecutor() - - -class DummyPlugin(run.Plugin): - def setup(self, task: run.Partial[train_model], executor: run.Executor): - task.epochs *= 2 - - -if __name__ == "__main__": - run.cli.main( - train_model, - default_factory=custom_defaults, - default_executor=local_executor(), - default_plugins=run.Config(DummyPlugin), - ) diff --git a/examples/entrypoint/test.toml b/examples/entrypoint/test.toml deleted file mode 100644 index 99a1bc25..00000000 --- a/examples/entrypoint/test.toml +++ /dev/null @@ -1,16 +0,0 @@ -_partial_ = true -_target_ = "__main__.train_model" -batch_size = 32 -epochs = 10 - -[model] -_target_ = "__main__.Model" -activation = "relu" -hidden_size = 256 -num_layers = 5 - -[optimizer] -_target_ = "__main__.Optimizer" -betas = [ 0.9, 0.999,] -learning_rate = 0.001 -weight_decay = 1e-5 diff --git a/examples/hello-world/hello_experiments.ipynb b/examples/hello-world/hello_experiments.ipynb deleted file mode 100644 index 24d68ab7..00000000 --- a/examples/hello-world/hello_experiments.ipynb +++ /dev/null @@ -1,281 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Hello NeMo-Run Experiments!\n", - "\n", - "This is the second part of our hello world tutorial series for NeMo-Run. Please make sure that you have gone through the [first part](hello_world.ipynb) beforehand, since this tutorial builds heavily on it.\n", - "\n", - "A key component of NeMo-Run is `run.Experiment`. For an introduction to `run.Experiment`, refer to its docstring, which is also posted below:\n", - "\n", - "`run.Experiment` is a context manager to launch and manage multiple runs using pure Python. It offers researchers with a simple and flexible way to create and manage their ML experiments. Building on the core components of NeMo-Run, `run.Experiment` can be used as an umbrella under which users can launch different configured functions across multiple remote clusters.\n", - "\n", - "The `run.Experiment` context manager takes care of storing the run metadata, launching it on the specified cluster, and syncing the logs and artifacts. Additionally, `run.Experiment` also provides management tools to easily inspect and reproduce past experiments.\n", - "Some of the use cases that it enables are listed below:\n", - "\n", - "1. Check the status and logs of a past experiment.\n", - "2. Reproduce a past experiment and rerun it.\n", - "3. Reconstruct a past experiment and relaunch it after some changes.\n", - "4. Compare different runs of the same experiment.\n", - "\n", - "This API allows users to programmatically define their experiments entirely in Python. To illustrate the flexibility it provides, here are some use cases that can be supported by `run.Experiment` with just a few lines of code.\n", - "\n", - "1. Launch a benchmarking run on different GPUs at the same time in parallel.\n", - "2. Launch a sequential data processing pipeline on a CPU heavy cluster.\n", - "3. Launch hyperparameter grid search runs on a single cluster in parallel.\n", - "4. Launch hyperparameter search runs distributed across all available clusters.\n", - "\n", - "The docstring also includes some code examples. In this tutorial, we build on `add_object` from the previous tutorial to define a simple experiment and show its capabilities.\n", - "\n", - "Let's get into it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", - "# SPDX-License-Identifier: Apache-2.0\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# http://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License.\n", - "\n", - "# Set up and imports\n", - "import nemo_run as run\n", - "from simple.add import SomeObject, add_object, commonly_used_object" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Configure the Python Functions\n", - "\n", - "First, let's configure the functions we want to run in our experiments. You can configure multiple functions under an experiment. Here, we will configure two functions, which will be partials of `add_object` but with different parameters." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fn_1 = run.Partial(\n", - " add_object,\n", - " obj_1=commonly_used_object(),\n", - " obj_2=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - ")\n", - "fn_1" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "fn_2 = run.Partial(\n", - " add_object,\n", - " # You can also pass in the argument directly instead of as a Config.\n", - " # However, this will run any code inside the `__init__` or `__post_init__` methods of the classes (if its a class).\n", - " obj_1=SomeObject(value_1=1000, value_2=1000, value_3=1000),\n", - " obj_2=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - ")\n", - "\n", - "fn_2" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Define and Run the Experiment\n", - "\n", - "Now, let's say we want to run these two configured functions together and manage them under an experiment. We can do so with just a few lines of code shown below. Try running it: it will launch the two tasks sequentially and wait for them to complete.\n", - "Notice that we set `sequential=True`, this is because parallel execution mode is not supported on the local executor as of now. This is intentional as launching parallel processes on your local workstation can quickly eat up your limited resources.\n", - "However, our `SlurmExecutor` supports parallel mode, (and is set to `True` by default). This will allow you to run both of your configured functions in parallel. An example is shown below:\n", - "\n", - "```python\n", - "with run.Experiment(\"add_object\", executor=run.LocalExecutor()) as exp:\n", - " exp.add(fn_1, tail_logs=True)\n", - " exp.add(fn_2, tail_logs=True)\n", - " exp.run()\n", - "```\n", - "\n", - "Additionally, you can also launch the functions on separate executors as shown below:\n", - "\n", - "```python\n", - "with run.Experiment(\"add_object\", executor=run.LocalExecutor()) as exp:\n", - " exp.add(fn_1, tail_logs=True)\n", - "\n", - " exp.add(fn_2, executor=your_slurm_executor(), tail_logs=True)\n", - " exp.run()\n", - "```\n", - "\n", - "The executor and configured functions are cloned in `exp.add` so you can mutate them as needed. This allows you to overwrite some parameters quickly. See the example below:\n", - "```python\n", - "with run.Experiment(\"add_object\", executor=run.LocalExecutor()) as exp:\n", - " exp.add(fn_1, tail_logs=True)\n", - "\n", - " fn_1.obj_1.value_1 = 0\n", - " exp.add(fn_1, executor=your_slurm_executor(), tail_logs=True)\n", - " exp.run()\n", - "```\n", - "\n", - ">📝 Currently, we only support sequential and parallel execution in an experiment. Directed Acyclic Graph (DAG) based execution is not yet supported.\n", - ">📝 To run the tasks in an experiment in parallel, all executors should support parallel mode as of now. We will relax this restriction soon.\n", - "\n", - ">📝 By default, the experiment metadata is stored in your home folder `~` inside the `.nemo_run` folder. However, you can also store it in a separate dir by setting the `NEMORUN_HOME` environment variable.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "with run.Experiment(\"add_object\", executor=run.LocalExecutor()) as exp:\n", - " exp.add(fn_1, tail_logs=True)\n", - " exp.add(fn_2, tail_logs=True)\n", - " exp.run()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Inspect the Experiment\n", - "\n", - "Additionally, you can also reconstruct and inspect an old experiment. There are a few utilities which allow you to list and inspect an experiment run. \n", - "Run the cells below to see the current management capabilities." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# List all runs of an experiment\n", - "# The last suffix is the timestamp and results are sorted in ascending order of timestamps\n", - "run.Experiment.catalog(\"add_object\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Reconstruct an experiment and inspect its status, logs, etc\n", - "# if id is None, it will take the latest run.\n", - "# if id is provided, it will use that particular run.\n", - "# status and logs can be used outside the context manager too\n", - "with run.Experiment.from_title(\"add_object\") as exp:\n", - " exp.status()\n", - " exp.logs(job_id=\"simple.add.add_object\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Create a new run of an old experiment\n", - "exp = run.Experiment.from_title(\"add_object\")\n", - "with exp.reset():\n", - " exp.tasks[0].obj_1 = exp.tasks[1].obj_1.clone()\n", - " exp.run(sequential=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For more information on how to inspect and reproduce experiments, please refer to the [inspect experiment tutorial](../experiments/inspect-experiment.ipynb)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Visualize the experiment configuration\n", - "exp" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Visualize tasks within the experiment\n", - "exp.tasks" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Diff two experiments\n", - "old_exp = run.Experiment.from_id(run.Experiment.catalog(\"add_object\")[-2])\n", - "exp.diff(old_exp, trim=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "exp.tasks[0].diff(old_exp.tasks[0], trim=False)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/examples/hello-world/hello_scripts.py b/examples/hello-world/hello_scripts.py deleted file mode 100644 index 4c7b622a..00000000 --- a/examples/hello-world/hello_scripts.py +++ /dev/null @@ -1,42 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from simple.add import SomeObject, add_object, commonly_used_object - -import nemo_run as run - -# This script defines an experiment that invokes three tasks in parallel, two scripts and a run.Partial. -# The example demonstrates how you can use scripts and run.Partial. -if __name__ == "__main__": - script = run.Script("./scripts/echo.sh") - inline_script = run.Script( - inline=""" -env -echo "Hello 1" -echo "Hello 2" -""" - ) - fn = run.Partial( - add_object, - obj_1=commonly_used_object(), - obj_2=run.Config(SomeObject, value_1=10, value_2=20, value_3=30), - ) - executor = run.LocalExecutor() - - with run.Experiment("experiment_with_scripts", executor=executor, log_level="WARN") as exp: - exp.add(script, tail_logs=True) - exp.add(inline_script, tail_logs=True) - exp.add(fn, tail_logs=True) - exp.run(detach=False) diff --git a/examples/hello-world/hello_world.ipynb b/examples/hello-world/hello_world.ipynb deleted file mode 100644 index 7f31de02..00000000 --- a/examples/hello-world/hello_world.ipynb +++ /dev/null @@ -1,1069 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Hello World! NeMo-Run Style!\n", - "\n", - "Let's start with a simple notebook that demonstrates how to use NeMo-Run to configure and launch your Python functions. In this example, we take simple addition functions that look like:\n", - "\n", - "```python\n", - "def add(a: int, b: int) -> int:\n", - " print(f\"Adding {a} to {b} returns {a + b}\")\n", - " return a + b\n", - "```\n", - "\n", - "and use NeMo-Run to configure and launch it. This basic notebook demonstrates many of the building blocks in our NeMo-Run library.\n", - "\n", - "As described in the introduction, NeMo-Run is a tool that allows you to:\n", - "1. Configure your functions or scripts in a Pythonic way.\n", - "2. Launch them on any supported remote cluster directly from your local workstation.\n", - "3. Manage them using `run.Experiment`.\n", - "\n", - "Let's get into it.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.\n", - "# SPDX-License-Identifier: Apache-2.0\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# http://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License.\n", - "\n", - "# Set up and imports\n", - "import logging\n", - "\n", - "import nemo_run as run\n", - "from simple.add import add\n", - "\n", - "logging.basicConfig(level=logging.INFO, format=\"%(asctime)s %(message)s\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Configure the Python Function\n", - "\n", - "The first step in using NeMo-Run is to configure your Python function. As mentioned above, we're trying to configure the `add` function. Configuration is similar to Python's native [functools.partial](https://docs.python.org/3/library/functools.html#functools.partial). In fact, for most functions, you can replace `functools.partial` with `run.Partial` and it should still work. Configuration just ties your function and your arguments together to create a `run.Partial` object which can executed at a later time. This is done by building the `run.Partial` object which will recursively build any configured objects and then return a [`functools.partial` object](https://docs.python.org/3/library/functools.html#partial-objects).\n", - "\n", - "This brings us to the first building blocks of NeMo-Run: `Partial` and `Config`. These are buildables that allow you to configure functions, arguments, and objects in a Pythonic way. Under the hood, we use [fiddle](https://fiddle.readthedocs.io/en/latest/) to manage the configuration. Our `Partial` and `Config` classes are subclasses of `fiddle.Partial` and `fiddle.Config`, respectively, with additional features to enhance UX and programmability. We take some inspiration from [Praxis](https://github.com/google/praxis/blob/main/praxis/pax_fiddle.py#L72), but also have custom addons that should improve the user experience.\n", - "\n", - "We already discussed `run.Partial`. Similarly, `run.Config` takes a `fn` or `class` as the first argument, followed by the `fn`'s or `class`\\' `*args`, and `**kwargs` as the subsequent arguments. However, on building `run.Config`, it actually calls the underlying `fn` or, in case of a `class`, its `__init__` method with the arguments tied into the `Config`. For example:\n", - "\n", - "\n", - "```python\n", - "import fiddle as fdl\n", - "import nemo_run as run\n", - "def hello_world(msg: str):\n", - " print(f\"Hello World! {msg}\")\n", - "\n", - "cfg = run.Config(hello_world, msg=\"How are you?\")\n", - "partial = run.Partial(hello_world, msg=\"How are you?\")\n", - "\n", - "fn = fdl.build(partial)\n", - "fn()\n", - "#>>> Hello World! How are you?\n", - "\n", - "built_cfg = fdl.build(cfg)\n", - "#>>> Hello World! How are you?\n", - "\n", - "print(built_cfg is None)\n", - "#>>> True\n", - "```\n", - "\n", - "The `Partial` and `Config` classes also come with utilities that provide a visual representation of the object (if you have `graphviz` installed). Try running the cell below to configure the `add` function and visualize the configured `Partial`." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "0\n", - "\n", - "\n", - "Partial:\n", - " add\n", - "\n", - "\n", - "a\n", - "\n", - "5\n", - "\n", - "\n", - "b\n", - "\n", - "10\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fn = run.Partial(add, a=5, b=10)\n", - "fn" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The configured function is now ready for execution. Before we proceed, let's look at configuring complex functions that take non-primitive types. Additionally, we will look at another utility that NeMo-Run provides: `run.autoconvert`. This decorator allows you to automatically convert functions that return regular Python objects into functions that return a `run.Config` of the underlying object. To demonstrate this, let's consider an `add_object` function that looks like:\n", - "\n", - "```python\n", - "@dataclass\n", - "class SomeObject:\n", - " value_1: int\n", - " value_2: int\n", - " value_3: int\n", - "\n", - "\n", - "def add_object(obj_1: SomeObject, obj_2: SomeObject) -> SomeObject:\n", - " result = SomeObject(\n", - " value_1=obj_1.value_1 + obj_2.value_1,\n", - " value_2=obj_1.value_2 + obj_2.value_2,\n", - " value_3=obj_1.value_3 + obj_2.value_3,\n", - " )\n", - " print(f\"{result = }\")\n", - "\n", - " return result\n", - "```\n", - "\n", - "To configure this function, you can use `run.Partial`. Note, however, that you need to ensure the arguments to the function are `run.Config` or `run.Partial` instances. This is necessary for serialization and remote execution, but more on that later. For now, a basic configuration for add_object looks like:\n", - "\n", - "```python\n", - "run.Partial(\n", - " add_object,\n", - " obj_1=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - " obj_2=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - ")\n", - "```\n", - "\n", - "Now, let's say you have a regular Python function that returns `SomeObject` and you want to use it as one of the arguments to `add_object`. You can decorate it using `run.autoconvert` as follows:\n", - "```python\n", - "@run.autoconvert\n", - "def commonly_used_object() -> SomeObject:\n", - " return SomeObject(\n", - " value_1=5,\n", - " value_2=10,\n", - " value_3=15,\n", - " )\n", - "```\n", - "\n", - "The `run.autoconvert` decorator uses fiddle's autoconfig and parses the AST so that you get the following:\n", - "\n", - "```python\n", - "commonly_used_object() == run.Config(SomeObject, value_1=5, value_2=10, value_3=15)\n", - "```\n", - "\n", - "\n", - "You can then use the decorated function as follows:\n", - "```python\n", - "run.Partial(\n", - " add_object,\n", - " obj_1=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - " obj_2=commonly_used_object(),\n", - ")\n", - "```\n", - "\n", - "Additionally, you can also use args in the function. Note, however, that `run.autoconvert` currently doesn't support control flow and may be unreliable for complex code. In that case, you can define a function to return the `run.Config` directly, as follows:\n", - "\n", - "```python\n", - "def commonly_used_config() -> run.Config[SomeObject]:\n", - " config = run.Config(\n", - " SomeObject,\n", - " value_1=5,\n", - " value_2=10,\n", - " value_3=15,\n", - " )\n", - "\n", - " for i in range(10):\n", - " config.value_1 *= i\n", - " config.value_2 += i\n", - " config.value_3 -= i\n", - "\n", - " return config\n", - "```\n", - "\n", - "> 📝 `autoconvert` returns a `run.Config` of the underlying object by default, whereas `@autoconvert(partial=True)` returns a `run.Partial` of the underlying object.\n", - "\n", - "One additional thing is that the arguments of a `run.Partial` or `run.Config` can be mutated via dot access, as shown below:\n", - "```python\n", - "fn.obj_1.value_1 = 100\n", - "fn.obj_2.value_2 *= 2\n", - "...\n", - "```\n", - "\n", - "As you can see, our tool is designed to provide a lot of flexibility. Run the cells shown below to experiment with these APIs and learn about the different ways to configure your task or function." - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "1\n", - "\n", - "\n", - "Config:\n", - " SomeObject\n", - "\n", - "\n", - "value_1\n", - "\n", - "10\n", - "\n", - "\n", - "value_2\n", - "\n", - "20\n", - "\n", - "\n", - "value_3\n", - "\n", - "30\n", - "\n", - "\n", - "\n", - "0\n", - "\n", - "\n", - "Partial:\n", - " add_object\n", - "\n", - "\n", - "obj_1\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "obj_2\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "0:c--1:c\n", - "\n", - "\n", - "\n", - "\n", - "2\n", - "\n", - "\n", - "Config:\n", - " SomeObject\n", - "\n", - "\n", - "value_1\n", - "\n", - "10\n", - "\n", - "\n", - "value_2\n", - "\n", - "20\n", - "\n", - "\n", - "value_3\n", - "\n", - "30\n", - "\n", - "\n", - "\n", - "0:c--2:c\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - ",\n", - " obj_2=)]>" - ] - }, - "execution_count": 12, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from simple.add import add_object, commonly_used_object, commonly_used_object_2, SomeObject\n", - "\n", - "fn = run.Partial(\n", - " add_object,\n", - " obj_1=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - " obj_2=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - ")\n", - "fn" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "1\n", - "\n", - "\n", - "Config:\n", - " SomeObject\n", - "\n", - "\n", - "value_1\n", - "\n", - "100\n", - "\n", - "\n", - "value_2\n", - "\n", - "10\n", - "\n", - "\n", - "value_3\n", - "\n", - "15\n", - "\n", - "\n", - "\n", - "0\n", - "\n", - "\n", - "Partial:\n", - " add_object\n", - "\n", - "\n", - "obj_1\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "obj_2\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "0:c--1:c\n", - "\n", - "\n", - "\n", - "\n", - "2\n", - "\n", - "\n", - "Config:\n", - " SomeObject\n", - "\n", - "\n", - "value_1\n", - "\n", - "500\n", - "\n", - "\n", - "value_2\n", - "\n", - "1000\n", - "\n", - "\n", - "value_3\n", - "\n", - "1500\n", - "\n", - "\n", - "\n", - "0:c--2:c\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - ",\n", - " obj_2=)]>" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fn = run.Partial(add_object, obj_1=commonly_used_object(), obj_2=commonly_used_object_2())\n", - "fn.obj_1.value_1 = 100\n", - "fn" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "1\n", - "\n", - "\n", - "Config:\n", - " SomeObject\n", - "\n", - "\n", - "value_1\n", - "\n", - "5\n", - "\n", - "\n", - "value_2\n", - "\n", - "10\n", - "\n", - "\n", - "value_3\n", - "\n", - "15\n", - "\n", - "\n", - "\n", - "0\n", - "\n", - "\n", - "Partial:\n", - " add_object\n", - "\n", - "\n", - "obj_1\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "obj_2\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "0:c--1:c\n", - "\n", - "\n", - "\n", - "\n", - "2\n", - "\n", - "\n", - "Config:\n", - " SomeObject\n", - "\n", - "\n", - "value_1\n", - "\n", - "10\n", - "\n", - "\n", - "value_2\n", - "\n", - "20\n", - "\n", - "\n", - "value_3\n", - "\n", - "30\n", - "\n", - "\n", - "\n", - "0:c--2:c\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - ",\n", - " obj_2=)]>" - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "fn = run.Partial(\n", - " add_object,\n", - " obj_1=commonly_used_object(),\n", - " obj_2=run.Config(SomeObject, value_1=10, value_2=20, value_3=30),\n", - ")\n", - "fn" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Execute the Configured Function\n", - "\n", - "The previous section concludes the core features of the NeMo-Run configuration. Now, let's move on to execution. The core building block to execute a single function is `run.run`.\n", - "\n", - "The easiest way to execute a configured function is to execute it directly, just like you would execute a normal Python function. You can do this using the `direct` option in `run.run` as shown below:\n", - "```python\n", - "# You can also do a dryrun to see what's getting executed where.\n", - "run.run(fn, direct=True, dryrun=True)\n", - "run.run(fn, direct=True)\n", - "```\n", - "\n", - "> 📝 As of now, the `run` function doesn't recreate the return value, so it is the user's responsibility to manage the artifacts from the function. For example, if you are launching a function to train an ML model, make sure you configure a job directory that you can access and inspect later. We are working on improving the management capabilities of a run and open to feedback, so please reach out to us if you have any ideas.\n", - "\n", - "Try it out in the cell below." - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
Dry run for task simple.add:add_object\n",
-       "
\n" - ], - "text/plain": [ - "\u001b[1;36mDry run for task simple.\u001b[0m\u001b[1;36madd:add\u001b[0m\u001b[1;36m_object\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
Resolved Arguments\n",
-       "
\n" - ], - "text/plain": [ - "\u001b[1;32mResolved Arguments\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n",
-       "┃ Argument Name         Resolved Value                                               ┃\n",
-       "┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n",
-       "│ obj_1                SomeObject(value_1=5, value_2=10, value_3=15)                │\n",
-       "│ obj_2                SomeObject(value_1=10, value_2=20, value_3=30)               │\n",
-       "└──────────────────────┴──────────────────────────────────────────────────────────────┘\n",
-       "
\n" - ], - "text/plain": [ - "┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓\n", - "┃\u001b[1;35m \u001b[0m\u001b[1;35mArgument Name \u001b[0m\u001b[1;35m \u001b[0m┃\u001b[1;35m \u001b[0m\u001b[1;35mResolved Value \u001b[0m\u001b[1;35m \u001b[0m┃\n", - "┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩\n", - "│\u001b[2m \u001b[0m\u001b[2mobj_1 \u001b[0m\u001b[2m \u001b[0m│ \u001b[1;35mSomeObject\u001b[0m\u001b[1m(\u001b[0m\u001b[33mvalue_1\u001b[0m=\u001b[1;36m5\u001b[0m, \u001b[33mvalue_2\u001b[0m=\u001b[1;36m10\u001b[0m, \u001b[33mvalue_3\u001b[0m=\u001b[1;36m15\u001b[0m\u001b[1m)\u001b[0m │\n", - "│\u001b[2m \u001b[0m\u001b[2mobj_2 \u001b[0m\u001b[2m \u001b[0m│ \u001b[1;35mSomeObject\u001b[0m\u001b[1m(\u001b[0m\u001b[33mvalue_1\u001b[0m=\u001b[1;36m10\u001b[0m, \u001b[33mvalue_2\u001b[0m=\u001b[1;36m20\u001b[0m, \u001b[33mvalue_3\u001b[0m=\u001b[1;36m30\u001b[0m\u001b[1m)\u001b[0m │\n", - "└──────────────────────┴──────────────────────────────────────────────────────────────┘\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Now real run\n", - "result = SomeObject(value_1=15, value_2=30, value_3=45)\n" - ] - } - ], - "source": [ - "# You can also do a dryrun to see what's getting executed where.\n", - "run.run(fn, direct=True, dryrun=True)\n", - "print(\"Now real run\")\n", - "run.run(fn, direct=True)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Direct runs are not that interesting, as you wouldn't need to use `run` if you were only running the function directly. Beyond direct runs, we support a few executors out of the box:\n", - "1. Local executor: This will run the function locally, but in a separate process. It is useful when you are using components like `torchrun` to launch your job/run.\n", - "1. Slurm executor: This will run the function on a remote Slurm cluster. Currently, only Slurm clusters that have [Pyxis](https://github.com/NVIDIA/pyxis) are supported, but we plan to add support for other types of Slurm clusters in the future. Reach out to us if you have a specific request.\n", - "1. Skypilot executor: This will run the function on any cloud supported by [Skypilot](https://skypilot.readthedocs.io/en/latest/).\n", - "\n", - "To learn more about executors, go to the execution guide linked in [README](../../README.md).\n", - "\n", - "> 📝 Currently, we do not pickle the function, but rely on Python module references for serializing the configured function. This means that executing a function defined in this notebook (or in the same script calling `run.run`) will probably not work. During packaging, we currently use `git archive` for our remote Slurm executor. You can specify a subpath which will be set as the working directory for the execution. To execute `add_object` function in this example, you need to provide a subpath of `examples/hello-world`, assuming `examples` is at the root of the repository. This will ensure that, during remote execution, imports of the style `from simple.add import add_object` work properly. We are looking to improve this user experience and open to suggestions, so please reach out to us if you have any ideas. We also plan to add `cloudpickle` support for arbitrarily defined functions.\n", - "\n", - "Under the hood, we are using [TorchX](https://pytorch.org/torchx/latest/) to manage the execution. However, this is fairly abstracted away from the user and we can potentially support more standalone executor libraries or add custom schedulers for TorchX in the future.\n", - "\n", - "Let's see the local executor in action. The local executor is the simplest executor we have and can be initialized without any arguments. Once we have an instance of `run.LocalExecutor`, we can pass it to `run.run` to execute the same configured function on the local executor, which will run the function in a separate process. This demonstrates the ease of use provided by NeMo-Run. You can configure a function once and execute it on any supported remote cluster seamlessly. Later, we will also explore `run.Experiment` which allows you to combine and manage multiple runs, providing additional flexibility to the user.\n", - "\n", - "Execute the cells below to run your function on the local executor." - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "image/svg+xml": [ - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "1\n", - "\n", - "\n", - "Config:\n", - " Packager\n", - "\n", - "\n", - "debug\n", - "\n", - "False\n", - "\n", - "\n", - "\n", - "0\n", - "\n", - "\n", - "Config:\n", - " LocalExecutor\n", - "\n", - "\n", - "packager\n", - "\n", - "\n", - "\n", - "\n", - "\n", - "launcher\n", - "\n", - "None\n", - "\n", - "\n", - "env_vars\n", - "\n", - "\n", - "\n", - "dict\n", - "\n", - "\n", - "retries\n", - "\n", - "0\n", - "\n", - "\n", - "experiment_id\n", - "\n", - "None\n", - "\n", - "\n", - "job_dir\n", - "\n", - "''\n", - "\n", - "\n", - "ntasks_per_node\n", - "\n", - "1\n", - "\n", - "\n", - "\n", - "0:c--1:c\n", - "\n", - "\n", - "\n", - "\n" - ], - "text/plain": [ - "LocalExecutor(packager=Packager(debug=False), launcher=None, env_vars={}, retries=0, experiment_id=None, job_dir='', experiment_dir='', _launcher_setup=False, ntasks_per_node=1)" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "executor = run.LocalExecutor()\n", - "executor" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
─────────────── Entering Experiment simple.add.add_object with id: simple.add.add_object_1730126761 ───────────────\n",
-       "
\n" - ], - "text/plain": [ - "\u001b[92m─────────────── \u001b[0m\u001b[1;35mEntering Experiment simple.add.add_object with id: simple.add.add_object_1730126761\u001b[0m\u001b[92m ───────────────\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Log directory is: /Users/andreyma/.nemo_run/experiments/simple.add.add_object/simple.add.add_object_1730126761/simple.add.add_object\n" - ] - }, - { - "data": { - "text/html": [ - "
[15:46:01] Launching job simple.add.add_object for experiment simple.add.add_object               experiment.py:660\n",
-       "
\n" - ], - "text/plain": [ - "\u001b[2;36m[15:46:01]\u001b[0m\u001b[2;36m \u001b[0m\u001b[1;36mLaunching job simple.add.add_object for experiment simple.add.add_object\u001b[0m \u001b]8;id=741146;file:///Users/andreyma/workspace/nvidia/NeMo-Run/src/nemo_run/run/experiment.py\u001b\\\u001b[2mexperiment.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=572721;file:///Users/andreyma/workspace/nvidia/NeMo-Run/src/nemo_run/run/experiment.py#660\u001b\\\u001b[2m660\u001b[0m\u001b]8;;\u001b\\\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Log directory is: /Users/andreyma/.nemo_run/experiments/simple.add.add_object/simple.add.add_object_1730126761/simple.add.add_object\n", - "Launched app: local_persistent://nemo_run/simple.add.add_object-ccgx567ghm3nwd\n", - "AppStatus:\n", - " State: RUNNING\n", - " Num Restarts: 0\n", - " Roles: \n", - " Msg: \n", - " Structured Error Msg: \n", - " UI URL: file:///Users/andreyma/.nemo_run/experiments/simple.add.add_object/simple.add.add_object_1730126761/simple.add.add_object/nemo_run/simple.add.add_object-ccgx567ghm3nwd\n", - " \n" - ] - }, - { - "data": { - "text/html": [ - "
──────────────────────── Waiting for Experiment simple.add.add_object_1730126761 to finish ────────────────────────\n",
-       "
\n" - ], - "text/plain": [ - "\u001b[92m──────────────────────── \u001b[0m\u001b[1;35mWaiting for Experiment simple.add.add_object_1730126761 to finish\u001b[0m\u001b[92m ────────────────────────\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-       "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
Experiment Status for simple.add.add_object_1730126761\n",
-       "
\n" - ], - "text/plain": [ - "\u001b[1;32mExperiment Status for\u001b[0m \u001b[1;38;5;214msimple.add.add_object_1730126761\u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-       "Task 0: simple.add.add_object\n",
-       "- Status: RUNNING\n",
-       "- Executor: LocalExecutor\n",
-       "- Job id: simple.add.add_object-ccgx567ghm3nwd\n",
-       "- Local Directory: /Users/andreyma/.nemo_run/experiments/simple.add.add_object/simple.add.add_object_1730126761/simple.add.add_object\n",
-       "
\n" - ], - "text/plain": [ - "\n", - "\u001b[1;32mTask 0\u001b[0m: \u001b[1;38;5;214msimple.add.add_object\u001b[0m\n", - "- \u001b[1;32mStatus\u001b[0m: RUNNING\n", - "- \u001b[1;32mExecutor\u001b[0m: LocalExecutor\n", - "- \u001b[1;32mJob id\u001b[0m: simple.add.add_object-ccgx567ghm3nwd\n", - "- \u001b[1;32mLocal Directory\u001b[0m: /Users/andreyma/.nemo_run/experiments/simple.add.add_object/simple.add.add_object_1730126761/simple.add.add_object\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
\n",
-       "
\n" - ], - "text/plain": [ - "\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Waiting for job simple.add.add_object-ccgx567ghm3nwd to finish [log=True]...\n" - ] - }, - { - "name": "stdout", - "output_type": "stream", - "text": [ - "add_object/0 result = SomeObject(value_1=15, value_2=30, value_3=45)\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Job simple.add.add_object-ccgx567ghm3nwd finished: SUCCEEDED\n" - ] - }, - { - "data": { - "text/html": [ - "
                                                                                                                   \n",
-       "# The experiment was run with the following tasks: ['simple.add.add_object']                                       \n",
-       "# You can inspect and reconstruct this experiment at a later point in time using:                                  \n",
-       "experiment = run.Experiment.from_id(\"simple.add.add_object_1730126761\")                                            \n",
-       "experiment.status() # Gets the overall status                                                                      \n",
-       "experiment.logs(\"simple.add.add_object\") # Gets the log for the provided task                                      \n",
-       "experiment.cancel(\"simple.add.add_object\") # Cancels the provided task if still running                            \n",
-       "                                                                                                                   \n",
-       "
\n" - ], - "text/plain": [ - "\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;149;144;119;48;2;39;40;34m# The experiment was run with the following tasks: ['simple.add.add_object']\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect and reconstruct this experiment at a later point in time using:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m=\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mrun\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mExperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mfrom_id\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msimple.add.add_object_1730126761\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the overall status\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msimple.add.add_object\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Gets the log for the provided task\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;255;70;137;48;2;39;40;34m.\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m(\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34msimple.add.add_object\u001b[0m\u001b[38;2;230;219;116;48;2;39;40;34m\"\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m)\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;149;144;119;48;2;39;40;34m# Cancels the provided task if still running\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[48;2;39;40;34m \u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "
                                                                                                                   \n",
-       "# You can inspect this experiment at a later point in time using the CLI as well:                                  \n",
-       "nemo experiment status simple.add.add_object_1730126761                                                            \n",
-       "nemo experiment logs simple.add.add_object_1730126761 0                                                            \n",
-       "nemo experiment cancel simple.add.add_object_1730126761 0                                                          \n",
-       "                                                                                                                   \n",
-       "
\n" - ], - "text/plain": [ - "\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;149;144;119;48;2;39;40;34m# You can inspect this experiment at a later point in time using the CLI as well:\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mstatus\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msimple.add.add_object_1730126761\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mlogs\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msimple.add.add_object_1730126761\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[38;2;248;248;242;48;2;39;40;34mnemo\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mexperiment\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34mcancel\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34msimple.add.add_object_1730126761\u001b[0m\u001b[38;2;248;248;242;48;2;39;40;34m \u001b[0m\u001b[38;2;174;129;255;48;2;39;40;34m0\u001b[0m\u001b[48;2;39;40;34m \u001b[0m\n", - "\u001b[48;2;39;40;34m \u001b[0m\n" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "run.run(fn, executor)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you notice the logs, it has mentions of Experiment. Each `run.run` internally creates an experiment with a single task to provide management capabilities for a run out of the box. We also make the `run.Experiment` API available publicly to create custom experiments and workflows. Check out our next tutorial in this series named [`hello_experiments.ipynb`](hello_experiments.ipynb) to learn more about experiments." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In the same way, you can execute your configured function on a remote Slurm cluster or remote Skypilot cluster as well. \n", - "\n", - "Below, we show an example of defining a `run.SlurmExecutor`. NeMo-Run also sets up an `ssh` tunnel automatically for you to connect to the Slurm cluster and handle the packaging of the code so you never have to leave your local workstation. You can configure the Slurm executor as below:\n", - "\n", - "```python\n", - "tunnel_cfg = run.Config(\n", - " TunnelConfig,\n", - " host=os.environ[\"SLURM_HOST\"],\n", - " user=os.environ[\"SLURM_USER\"],\n", - " job_dir=os.environ[\"SLURM_JOBDIR\"],\n", - ")\n", - "packager = run.Config(\n", - " GitArchivePackager,\n", - " use_torchrun=False,\n", - " subpath=\"examples/hello-world\"\n", - ")\n", - "\n", - "executor = run.Config(\n", - " SlurmExecutorConfig,\n", - " account=os.environ[\"SLURM_ACCT\"],\n", - " partition=os.environ[\"SLURM_PARTITION\"],\n", - " nodes=1,\n", - " ntasks_per_node=1,\n", - " tunnel=tunnel,\n", - " container_image=os.environ[\"BASE_IMAGE\"],\n", - " time=\"00:30:00\",\n", - " packager=packager,\n", - ")\n", - "```\n", - "\n", - "More details about configuring the executors can be found in the `API Reference` portion of our docs." - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": ".venv", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/examples/hello-world/scripts/echo.sh b/examples/hello-world/scripts/echo.sh deleted file mode 100644 index d32f85cd..00000000 --- a/examples/hello-world/scripts/echo.sh +++ /dev/null @@ -1,18 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -#!/bin/bash - -echo "Hello" diff --git a/examples/hello-world/simple/__init__.py b/examples/hello-world/simple/__init__.py deleted file mode 100644 index 47f1c65a..00000000 --- a/examples/hello-world/simple/__init__.py +++ /dev/null @@ -1,15 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - diff --git a/examples/hello-world/simple/add.py b/examples/hello-world/simple/add.py deleted file mode 100644 index 264630de..00000000 --- a/examples/hello-world/simple/add.py +++ /dev/null @@ -1,59 +0,0 @@ -# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: Apache-2.0 -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from dataclasses import dataclass - -import nemo_run as run - - -def add(a: int, b: int) -> int: - print(f"Adding {a} to {b} returns {a + b}") - return a + b - - -@dataclass -class SomeObject: - value_1: int - value_2: int - value_3: int - - -def add_object(obj_1: SomeObject, obj_2: SomeObject) -> SomeObject: - result = SomeObject( - value_1=obj_1.value_1 + obj_2.value_1, - value_2=obj_1.value_2 + obj_2.value_2, - value_3=obj_1.value_3 + obj_2.value_3, - ) - print(f"{result = }") - - return result - - -@run.autoconvert -def commonly_used_object() -> SomeObject: - return SomeObject( - value_1=5, - value_2=10, - value_3=15, - ) - - -@run.autoconvert -def commonly_used_object_2() -> SomeObject: - return SomeObject( - value_1=500, - value_2=1000, - value_3=1500, - )