diff --git a/torchci/lib/bot/README.md b/torchci/lib/bot/README.md new file mode 100644 index 0000000000..a173682597 --- /dev/null +++ b/torchci/lib/bot/README.md @@ -0,0 +1,362 @@ +# PyTorch Bot Architecture Analysis + +## Overview + +The PyTorch bot is a GitHub webhook automation system built with **Probot** that manages CI/CD workflows, code reviews, and development operations for the PyTorch ecosystem. It's deployed as a Next.js application on Vercel and integrates with multiple external services. + +## Core Architecture + +### Entry Points + +- **Main Entry**: `lib/bot/index.ts:17` - Registers all bot modules with Probot +- **Command Handler**: `lib/bot/pytorchBot.ts:6` - Handles `@pytorchbot` commands via comments and reviews +- **Command Parser**: `lib/bot/cliParser.ts:15` - Parses bot commands using argparse-style CLI interface + +### Command System + +The bot supports these primary commands: + +- **`merge`** - Merges PRs with approval validation and force-merge capabilities +- **`revert`** - Reverts merged PRs with classification tracking +- **`rebase`** - Rebases PRs onto target branches +- **`label`** - Adds labels with permission validation +- **`cherry-pick`** - Cherry-picks PRs to release branches +- **`drci`** - Updates Dr. CI status comments + +### Permission System (`lib/bot/utils.ts:248`) + +- **Write Permissions**: Admin/write collaborators can use force-merge, ignore-current flags +- **Rebase Permissions**: Write permissions OR non-first-time contributors +- **Workflow Permissions**: Write permissions OR users with approved pull runs +- **Authorization Tracking**: Uses GitHub's collaborator permission API + +## Bot Modules + +### Core Command Bots + +1. **pytorchBotHandler** (`lib/bot/pytorchBotHandler.ts:41`) - Central command processor +2. **cliParser** (`lib/bot/cliParser.ts:7`) - Command-line interface parser + +### Automation Bots + +3. **autoLabelBot** - Smart labeling based on file changes and patterns +4. **autoCcBot** - Auto-CC users based on label subscriptions +5. **retryBot** - Intelligent CI retry using flakiness analytics +6. **ciflowPushTrigger** - Git tag management for CI flow triggers +7. **cancelWorkflowsOnCloseBot** - Resource cleanup on PR closure + +### CI Integration Bots + +8. **triggerCircleCIWorkflows** - CircleCI pipeline integration +9. **triggerInductorTestsBot** - PyTorch Inductor test triggering +10. **verifyDisableTestIssueBot** - Test disabling authorization + +### Security & Review Bots + +11. **stripApprovalBot** - Removes approvals on PR reopen +12. **codevNoWritePermBot** - Notifies about permission requirements +13. **drciBot** - Dr. CI dashboard integration + +### Infrastructure Bots + +14. **webhookToDynamo** - Event logging to DynamoDB +15. **pytorchbotLogger** - Bot action logging + +## Detailed Bot Analysis + +### 1. autoLabelBot.ts + +**Primary Purpose:** Automatically assigns labels to pull requests and issues based on various criteria including file paths, titles, and patterns. + +**Key Features:** + +- **Title-based labeling**: Matches PR/issue titles against regex patterns to assign relevant labels +- **File-based labeling**: Analyzes changed files to assign module-specific and release note labels +- **Repository-specific rules**: Applies custom labeling rules based on the repository +- **CIFlow integration**: Assigns ciflow/\* labels based on changed files (e.g., MPS, H100 symmetry memory tests) +- **Release notes categorization**: Automatically categorizes PRs for release notes (PyTorch-specific) +- **Permission filtering**: Only applies CI flow labels if the author has appropriate permissions + +**GitHub Webhooks:** + +- `issues.labeled`, `issues.opened`, `issues.edited` +- `pull_request.opened`, `pull_request.edited`, `pull_request.synchronize` + +**Special Logic:** Filters CI flow labels based on user permissions and workflow approval status + +### 2. autoCcBot.ts + +**Primary Purpose:** Automatically CC (carbon copy) relevant users when specific labels are applied to issues or PRs. + +**Key Features:** + +- **Subscription management**: Loads user subscriptions from a tracking issue +- **Dynamic CC lists**: Updates CC lists in issue/PR descriptions based on applied labels +- **Self-removal**: Prevents users from being CC'd on their own issues/PRs +- **Incremental updates**: Only adds new CCs, preserving existing ones + +**GitHub Webhooks:** + +- `issues.labeled` +- `pull_request.labeled` + +**Special Logic:** Parses subscription data from a configured tracking issue and maintains CC lists without duplicating existing mentions + +### 3. retryBot.ts + +**Primary Purpose:** Intelligently retries failed CI workflows and jobs based on failure patterns and flakiness analysis. + +**Key Features:** + +- **Smart retry logic**: Distinguishes between infrastructure failures and code-related failures +- **Flaky job detection**: Queries ClickHouse for flaky job data from previous workflows +- **Configurable workflows**: Only retries workflows specified in configuration +- **Failure threshold**: Limits retries when too many jobs fail (>5 jobs) +- **Branch-specific behavior**: Different retry logic for main branch vs. feature branches +- **Always-retry jobs**: Specific jobs that are retried regardless of failure type + +**GitHub Webhooks:** + +- `workflow_run.completed` + +**Special Logic:** Uses ML/analytics data from ClickHouse to make intelligent retry decisions + +### 4. ciflowPushTrigger.ts + +**Primary Purpose:** Manages Git tags that trigger CI workflows based on CI flow labels applied to PRs. + +**Key Features:** + +- **Tag synchronization**: Creates/updates Git tags when CI flow labels are added +- **Permission validation**: Ensures only authorized users can trigger CI flows +- **Tag cleanup**: Removes tags when labels are removed or PRs are closed +- **Configuration validation**: Validates labels against configured allowed CI flow tags +- **Permission-based filtering**: Removes CI flow labels from unauthorized PRs + +**GitHub Webhooks:** + +- `pull_request.labeled`, `pull_request.unlabeled` +- `pull_request.synchronize`, `pull_request.opened`, `pull_request.reopened`, `pull_request.closed` + +**Special Logic:** Creates tags in format `ciflow/label/PR_NUMBER` to trigger downstream CI systems + +### 5. triggerCircleCIWorkflows.ts + +**Primary Purpose:** Integrates with CircleCI by triggering workflows based on GitHub events and labels. + +**Key Features:** + +- **Label-to-parameter mapping**: Converts GitHub labels to CircleCI pipeline parameters +- **Branch/tag filtering**: Different behavior for push events vs. pull requests +- **Configuration-driven**: Uses YAML config to define label-to-parameter mappings +- **Fork handling**: Special handling for PRs from forked repositories +- **Default parameters**: Supports default parameter values for workflows + +**GitHub Webhooks:** + +- `pull_request.labeled`, `pull_request.synchronize` +- `push` + +**Special Logic:** Translates GitHub repository state into CircleCI pipeline parameters using configurable mappings + +### 6. triggerInductorTestsBot.ts + +**Primary Purpose:** Allows authorized users to trigger PyTorch Inductor tests via comment commands. + +**Key Features:** + +- **Comment-based triggering**: Responds to `@pytorch run pytorch tests` comments +- **User authorization**: Restricts access to pre-approved users and repositories +- **Cross-repository workflow**: Triggers workflows in pytorch/pytorch-integration-testing +- **Commit handling**: Uses appropriate commit SHAs for different repositories +- **Error handling**: Provides feedback on success/failure of test triggering + +**GitHub Webhooks:** + +- `issue_comment.created` + +**Special Logic:** Security-focused with explicit allowlists for users and repositories + +### 7. cancelWorkflowsOnCloseBot.ts + +**Primary Purpose:** Cancels running GitHub Actions workflows when PRs are closed to save compute resources. + +**Key Features:** + +- **Automatic cancellation**: Cancels all running workflows associated with a PR's head SHA +- **Bot exclusions**: Doesn't cancel workflows for bot users (pytorchbot, pytorchmergebot) +- **Repository filtering**: Only operates on pytorch/pytorch repository +- **Merge detection**: Skips cancellation for PRs that were actually merged +- **Batch processing**: Cancels multiple workflows concurrently + +**GitHub Webhooks:** + +- `pull_request.closed` + +**Special Logic:** Prevents unnecessary resource usage by canceling workflows for closed/abandoned PRs + +### 8. verifyDisableTestIssueBot.ts + +**Primary Purpose:** Validates and processes issues that request disabling or marking tests as unstable. + +**Key Features:** + +- **Title parsing**: Recognizes DISABLED and UNSTABLE prefixes in issue titles +- **Authorization validation**: Checks if users have permission to disable tests +- **Validation comments**: Posts detailed validation information about the disable request +- **Auto-closure**: Automatically closes unauthorized disable requests +- **Multi-format support**: Handles single test disables and aggregate disable issues + +**GitHub Webhooks:** + +- `issues.opened`, `issues.edited` + +**Special Logic:** Critical security component that ensures only authorized users can disable CI tests + +### 9. stripApprovalBot.ts + +**Primary Purpose:** Removes PR approvals when PRs are reopened to ensure fresh review. + +**Key Features:** + +- **Approval dismissal**: Automatically dismisses all existing approvals on PR reopening +- **Permission-based**: Only acts on PRs from users without write permissions +- **Notification messages**: Provides clear explanation for why approvals were removed +- **Security-focused**: Ensures that reopened PRs (potentially after reverts) get fresh review + +**GitHub Webhooks:** + +- `pull_request.reopened` + +**Special Logic:** Maintains code review integrity by requiring fresh approvals after PR reopening + +### 10. codevNoWritePermBot.ts + +**Primary Purpose:** Notifies Phabricator/Codev users when they need GitHub write permissions for CI. + +**Key Features:** + +- **Differential detection**: Recognizes PRs exported from Phabricator (Differential Revision markers) +- **Permission checking**: Verifies if the author has write permissions +- **Helpful messaging**: Provides links to internal documentation for getting permissions +- **Repository filtering**: Only operates on pytorch/pytorch repository + +**GitHub Webhooks:** + +- `pull_request.opened` + +**Special Logic:** Bridges the gap between internal Facebook/Meta development workflow and external GitHub CI requirements + +### 11. drciBot.ts + +**Primary Purpose:** Manages Dr. CI (Diagnostic CI) comments that provide comprehensive PR status information. + +**Key Features:** + +- **Status aggregation**: Creates/updates comprehensive status comments on PRs +- **Integration with DrCI utilities**: Leverages external DrCI infrastructure +- **PR state tracking**: Only operates on open PRs +- **URL integration**: Links to external Dr. CI dashboard + +**GitHub Webhooks:** + +- `pull_request.opened`, `pull_request.synchronize` + +**Special Logic:** Serves as the interface between GitHub PRs and the comprehensive Dr. CI dashboard system + +### 12. webhookToDynamo.ts + +**Primary Purpose:** Logs GitHub webhook events to DynamoDB tables for analytics and auditing. + +**Key Features:** + +- **Comprehensive logging**: Captures workflow runs, jobs, issues, PRs, comments, and reviews +- **Structured storage**: Organizes data into specific DynamoDB tables by event type +- **Key prefixing**: Prevents conflicts by prefixing keys with repository information +- **Label tracking**: Special handling for label events with timestamp tracking +- **UUID generation**: Uses UUIDs for events that don't have natural unique identifiers + +**GitHub Webhooks:** + +- `workflow_job`, `workflow_run`, `issues`, `issue_comment` +- `pull_request`, `pull_request_review`, `pull_request_review_comment`, `push` + +**Special Logic:** Forms the foundation of the analytics and monitoring infrastructure by persisting all relevant GitHub events + +## External Integrations + +### Data Storage + +- **DynamoDB**: Event logging, bot action tracking (`lib/bot/pytorchbotLogger.ts:4`) +- **ClickHouse**: CI analytics, flaky test data queries (`lib/bot/pytorchBotHandler.ts:5`) + +### CI Systems + +- **GitHub Actions**: Workflow triggering via repository dispatch events +- **CircleCI**: Parameter-based workflow triggering +- **Dr. CI**: Comprehensive status dashboard integration + +### Configuration Management + +- **Repository Configs**: `.github/pytorch-probot.yml` files (`lib/bot/utils.ts:64`) +- **Cached Config Tracker**: Performance optimization for config loading (`lib/bot/utils.ts:46`) +- **Label Subscriptions**: Issue-based user subscription management + +## Key Features + +### Intelligent Merge System + +- **Approval Validation**: Reviews from COLLABORATOR+ required for PyTorch repos +- **Force Merge**: Admin-only with audit trail and reason requirement +- **CI Flow Labels**: Automatic trunk/pull label management +- **Branch Targeting**: Supports viable/strict and main branch merging + +### Smart Retry Logic (`retryBot.ts`) + +- **Flakiness Analysis**: Queries historical data to identify infrastructure failures +- **Selective Retrying**: Only retries jobs likely to succeed on retry +- **Branch-specific Rules**: Different behavior for main vs. feature branches + +### Permission-based Security + +- **Multi-tier Authorization**: Different permission levels for different actions +- **First-time Contributor Handling**: Restricted permissions for new contributors +- **Audit Logging**: All bot actions logged to DynamoDB + +### Auto-labeling Intelligence + +- **File Pattern Matching**: Assigns module labels based on changed files +- **CI Flow Detection**: Automatic ciflow/\* label assignment +- **Release Note Categorization**: Automated release note classification + +## Data Flow + +1. **GitHub Webhook** → **Probot App** → **Bot Module Router** +2. **Command Parsing** → **Permission Validation** → **Action Execution** +3. **External API Calls** (GitHub, CircleCI, ClickHouse) +4. **Event Logging** (DynamoDB) + **Response** (GitHub reactions/comments) + +## Integration Architecture + +These bots work together as a cohesive CI/CD and development workflow system: + +- **Permission System**: Multiple bots check `hasWritePermissions` and `hasApprovedPullRuns` for security +- **Configuration Management**: Many bots use `CachedConfigTracker` for repository-specific settings +- **Event Coordination**: Bots respond to related events (e.g., label changes trigger multiple bots) +- **Data Analytics**: Several bots feed data to ClickHouse and DynamoDB for decision-making +- **External Integrations**: Connect GitHub to CircleCI, Dr. CI dashboard, and internal Meta systems + +## Deployment Context + +- **Platform**: Vercel (Next.js) +- **Framework**: Probot (GitHub Apps framework) +- **Language**: TypeScript with modern ES modules +- **Monitoring**: DynamoDB logging + external Dr. CI dashboard + +## Configuration Files + +- `Constants.ts:1` - Cherry-pick and revert classifications +- `subscriptions.ts:1` - Label subscription parsing utilities +- Repository-specific configs loaded via `CachedConfigTracker` + +This bot ecosystem provides comprehensive automation for the PyTorch development workflow, balancing developer productivity with security and code quality requirements through intelligent automation and robust permission systems.