Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
362 changes: 362 additions & 0 deletions torchci/lib/bot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,362 @@
# PyTorch Bot Architecture Analysis

## Overview

The PyTorch bot is a GitHub webhook automation system built with **Probot** that manages CI/CD workflows, code reviews, and development operations for the PyTorch ecosystem. It's deployed as a Next.js application on Vercel and integrates with multiple external services.

## Core Architecture

### Entry Points

- **Main Entry**: `lib/bot/index.ts:17` - Registers all bot modules with Probot
- **Command Handler**: `lib/bot/pytorchBot.ts:6` - Handles `@pytorchbot` commands via comments and reviews
- **Command Parser**: `lib/bot/cliParser.ts:15` - Parses bot commands using argparse-style CLI interface

### Command System

The bot supports these primary commands:

- **`merge`** - Merges PRs with approval validation and force-merge capabilities
- **`revert`** - Reverts merged PRs with classification tracking
- **`rebase`** - Rebases PRs onto target branches
- **`label`** - Adds labels with permission validation
- **`cherry-pick`** - Cherry-picks PRs to release branches
- **`drci`** - Updates Dr. CI status comments

### Permission System (`lib/bot/utils.ts:248`)

- **Write Permissions**: Admin/write collaborators can use force-merge, ignore-current flags
- **Rebase Permissions**: Write permissions OR non-first-time contributors
- **Workflow Permissions**: Write permissions OR users with approved pull runs
- **Authorization Tracking**: Uses GitHub's collaborator permission API

## Bot Modules

### Core Command Bots

1. **pytorchBotHandler** (`lib/bot/pytorchBotHandler.ts:41`) - Central command processor
2. **cliParser** (`lib/bot/cliParser.ts:7`) - Command-line interface parser

### Automation Bots

3. **autoLabelBot** - Smart labeling based on file changes and patterns
4. **autoCcBot** - Auto-CC users based on label subscriptions
5. **retryBot** - Intelligent CI retry using flakiness analytics
6. **ciflowPushTrigger** - Git tag management for CI flow triggers
7. **cancelWorkflowsOnCloseBot** - Resource cleanup on PR closure

### CI Integration Bots

8. **triggerCircleCIWorkflows** - CircleCI pipeline integration
9. **triggerInductorTestsBot** - PyTorch Inductor test triggering
10. **verifyDisableTestIssueBot** - Test disabling authorization

### Security & Review Bots

11. **stripApprovalBot** - Removes approvals on PR reopen
12. **codevNoWritePermBot** - Notifies about permission requirements
13. **drciBot** - Dr. CI dashboard integration

### Infrastructure Bots

14. **webhookToDynamo** - Event logging to DynamoDB
15. **pytorchbotLogger** - Bot action logging

## Detailed Bot Analysis

### 1. autoLabelBot.ts

**Primary Purpose:** Automatically assigns labels to pull requests and issues based on various criteria including file paths, titles, and patterns.

**Key Features:**

- **Title-based labeling**: Matches PR/issue titles against regex patterns to assign relevant labels
- **File-based labeling**: Analyzes changed files to assign module-specific and release note labels
- **Repository-specific rules**: Applies custom labeling rules based on the repository
- **CIFlow integration**: Assigns ciflow/\* labels based on changed files (e.g., MPS, H100 symmetry memory tests)
- **Release notes categorization**: Automatically categorizes PRs for release notes (PyTorch-specific)
- **Permission filtering**: Only applies CI flow labels if the author has appropriate permissions

**GitHub Webhooks:**

- `issues.labeled`, `issues.opened`, `issues.edited`
- `pull_request.opened`, `pull_request.edited`, `pull_request.synchronize`

**Special Logic:** Filters CI flow labels based on user permissions and workflow approval status

### 2. autoCcBot.ts

**Primary Purpose:** Automatically CC (carbon copy) relevant users when specific labels are applied to issues or PRs.

**Key Features:**

- **Subscription management**: Loads user subscriptions from a tracking issue
- **Dynamic CC lists**: Updates CC lists in issue/PR descriptions based on applied labels
- **Self-removal**: Prevents users from being CC'd on their own issues/PRs
- **Incremental updates**: Only adds new CCs, preserving existing ones

**GitHub Webhooks:**

- `issues.labeled`
- `pull_request.labeled`

**Special Logic:** Parses subscription data from a configured tracking issue and maintains CC lists without duplicating existing mentions

### 3. retryBot.ts

**Primary Purpose:** Intelligently retries failed CI workflows and jobs based on failure patterns and flakiness analysis.

**Key Features:**

- **Smart retry logic**: Distinguishes between infrastructure failures and code-related failures
- **Flaky job detection**: Queries ClickHouse for flaky job data from previous workflows
- **Configurable workflows**: Only retries workflows specified in configuration
- **Failure threshold**: Limits retries when too many jobs fail (>5 jobs)
- **Branch-specific behavior**: Different retry logic for main branch vs. feature branches
- **Always-retry jobs**: Specific jobs that are retried regardless of failure type

**GitHub Webhooks:**

- `workflow_run.completed`

**Special Logic:** Uses ML/analytics data from ClickHouse to make intelligent retry decisions

### 4. ciflowPushTrigger.ts

**Primary Purpose:** Manages Git tags that trigger CI workflows based on CI flow labels applied to PRs.

**Key Features:**

- **Tag synchronization**: Creates/updates Git tags when CI flow labels are added
- **Permission validation**: Ensures only authorized users can trigger CI flows
- **Tag cleanup**: Removes tags when labels are removed or PRs are closed
- **Configuration validation**: Validates labels against configured allowed CI flow tags
- **Permission-based filtering**: Removes CI flow labels from unauthorized PRs

**GitHub Webhooks:**

- `pull_request.labeled`, `pull_request.unlabeled`
- `pull_request.synchronize`, `pull_request.opened`, `pull_request.reopened`, `pull_request.closed`

**Special Logic:** Creates tags in format `ciflow/label/PR_NUMBER` to trigger downstream CI systems

### 5. triggerCircleCIWorkflows.ts

**Primary Purpose:** Integrates with CircleCI by triggering workflows based on GitHub events and labels.

**Key Features:**

- **Label-to-parameter mapping**: Converts GitHub labels to CircleCI pipeline parameters
- **Branch/tag filtering**: Different behavior for push events vs. pull requests
- **Configuration-driven**: Uses YAML config to define label-to-parameter mappings
- **Fork handling**: Special handling for PRs from forked repositories
- **Default parameters**: Supports default parameter values for workflows

**GitHub Webhooks:**

- `pull_request.labeled`, `pull_request.synchronize`
- `push`

**Special Logic:** Translates GitHub repository state into CircleCI pipeline parameters using configurable mappings

### 6. triggerInductorTestsBot.ts

**Primary Purpose:** Allows authorized users to trigger PyTorch Inductor tests via comment commands.

**Key Features:**

- **Comment-based triggering**: Responds to `@pytorch run pytorch tests` comments
- **User authorization**: Restricts access to pre-approved users and repositories
- **Cross-repository workflow**: Triggers workflows in pytorch/pytorch-integration-testing
- **Commit handling**: Uses appropriate commit SHAs for different repositories
- **Error handling**: Provides feedback on success/failure of test triggering

**GitHub Webhooks:**

- `issue_comment.created`

**Special Logic:** Security-focused with explicit allowlists for users and repositories

### 7. cancelWorkflowsOnCloseBot.ts

**Primary Purpose:** Cancels running GitHub Actions workflows when PRs are closed to save compute resources.

**Key Features:**

- **Automatic cancellation**: Cancels all running workflows associated with a PR's head SHA
- **Bot exclusions**: Doesn't cancel workflows for bot users (pytorchbot, pytorchmergebot)
- **Repository filtering**: Only operates on pytorch/pytorch repository
- **Merge detection**: Skips cancellation for PRs that were actually merged
- **Batch processing**: Cancels multiple workflows concurrently

**GitHub Webhooks:**

- `pull_request.closed`

**Special Logic:** Prevents unnecessary resource usage by canceling workflows for closed/abandoned PRs

### 8. verifyDisableTestIssueBot.ts

**Primary Purpose:** Validates and processes issues that request disabling or marking tests as unstable.

**Key Features:**

- **Title parsing**: Recognizes DISABLED and UNSTABLE prefixes in issue titles
- **Authorization validation**: Checks if users have permission to disable tests
- **Validation comments**: Posts detailed validation information about the disable request
- **Auto-closure**: Automatically closes unauthorized disable requests
- **Multi-format support**: Handles single test disables and aggregate disable issues

**GitHub Webhooks:**

- `issues.opened`, `issues.edited`

**Special Logic:** Critical security component that ensures only authorized users can disable CI tests

### 9. stripApprovalBot.ts

**Primary Purpose:** Removes PR approvals when PRs are reopened to ensure fresh review.

**Key Features:**

- **Approval dismissal**: Automatically dismisses all existing approvals on PR reopening
- **Permission-based**: Only acts on PRs from users without write permissions
- **Notification messages**: Provides clear explanation for why approvals were removed
- **Security-focused**: Ensures that reopened PRs (potentially after reverts) get fresh review

**GitHub Webhooks:**

- `pull_request.reopened`

**Special Logic:** Maintains code review integrity by requiring fresh approvals after PR reopening

### 10. codevNoWritePermBot.ts

**Primary Purpose:** Notifies Phabricator/Codev users when they need GitHub write permissions for CI.

**Key Features:**

- **Differential detection**: Recognizes PRs exported from Phabricator (Differential Revision markers)
- **Permission checking**: Verifies if the author has write permissions
- **Helpful messaging**: Provides links to internal documentation for getting permissions
- **Repository filtering**: Only operates on pytorch/pytorch repository

**GitHub Webhooks:**

- `pull_request.opened`

**Special Logic:** Bridges the gap between internal Facebook/Meta development workflow and external GitHub CI requirements

### 11. drciBot.ts

**Primary Purpose:** Manages Dr. CI (Diagnostic CI) comments that provide comprehensive PR status information.

**Key Features:**

- **Status aggregation**: Creates/updates comprehensive status comments on PRs
- **Integration with DrCI utilities**: Leverages external DrCI infrastructure
- **PR state tracking**: Only operates on open PRs
- **URL integration**: Links to external Dr. CI dashboard

**GitHub Webhooks:**

- `pull_request.opened`, `pull_request.synchronize`

**Special Logic:** Serves as the interface between GitHub PRs and the comprehensive Dr. CI dashboard system

### 12. webhookToDynamo.ts

**Primary Purpose:** Logs GitHub webhook events to DynamoDB tables for analytics and auditing.

**Key Features:**

- **Comprehensive logging**: Captures workflow runs, jobs, issues, PRs, comments, and reviews
- **Structured storage**: Organizes data into specific DynamoDB tables by event type
- **Key prefixing**: Prevents conflicts by prefixing keys with repository information
- **Label tracking**: Special handling for label events with timestamp tracking
- **UUID generation**: Uses UUIDs for events that don't have natural unique identifiers

**GitHub Webhooks:**

- `workflow_job`, `workflow_run`, `issues`, `issue_comment`
- `pull_request`, `pull_request_review`, `pull_request_review_comment`, `push`

**Special Logic:** Forms the foundation of the analytics and monitoring infrastructure by persisting all relevant GitHub events

## External Integrations

### Data Storage

- **DynamoDB**: Event logging, bot action tracking (`lib/bot/pytorchbotLogger.ts:4`)
- **ClickHouse**: CI analytics, flaky test data queries (`lib/bot/pytorchBotHandler.ts:5`)

### CI Systems

- **GitHub Actions**: Workflow triggering via repository dispatch events
- **CircleCI**: Parameter-based workflow triggering
- **Dr. CI**: Comprehensive status dashboard integration

### Configuration Management

- **Repository Configs**: `.github/pytorch-probot.yml` files (`lib/bot/utils.ts:64`)
- **Cached Config Tracker**: Performance optimization for config loading (`lib/bot/utils.ts:46`)
- **Label Subscriptions**: Issue-based user subscription management

## Key Features

### Intelligent Merge System

- **Approval Validation**: Reviews from COLLABORATOR+ required for PyTorch repos
- **Force Merge**: Admin-only with audit trail and reason requirement
- **CI Flow Labels**: Automatic trunk/pull label management
- **Branch Targeting**: Supports viable/strict and main branch merging

### Smart Retry Logic (`retryBot.ts`)

- **Flakiness Analysis**: Queries historical data to identify infrastructure failures
- **Selective Retrying**: Only retries jobs likely to succeed on retry
- **Branch-specific Rules**: Different behavior for main vs. feature branches

### Permission-based Security

- **Multi-tier Authorization**: Different permission levels for different actions
- **First-time Contributor Handling**: Restricted permissions for new contributors
- **Audit Logging**: All bot actions logged to DynamoDB

### Auto-labeling Intelligence

- **File Pattern Matching**: Assigns module labels based on changed files
- **CI Flow Detection**: Automatic ciflow/\* label assignment
- **Release Note Categorization**: Automated release note classification

## Data Flow

1. **GitHub Webhook** → **Probot App** → **Bot Module Router**
2. **Command Parsing** → **Permission Validation** → **Action Execution**
3. **External API Calls** (GitHub, CircleCI, ClickHouse)
4. **Event Logging** (DynamoDB) + **Response** (GitHub reactions/comments)

## Integration Architecture

These bots work together as a cohesive CI/CD and development workflow system:

- **Permission System**: Multiple bots check `hasWritePermissions` and `hasApprovedPullRuns` for security
- **Configuration Management**: Many bots use `CachedConfigTracker` for repository-specific settings
- **Event Coordination**: Bots respond to related events (e.g., label changes trigger multiple bots)
- **Data Analytics**: Several bots feed data to ClickHouse and DynamoDB for decision-making
- **External Integrations**: Connect GitHub to CircleCI, Dr. CI dashboard, and internal Meta systems

## Deployment Context

- **Platform**: Vercel (Next.js)
- **Framework**: Probot (GitHub Apps framework)
- **Language**: TypeScript with modern ES modules
- **Monitoring**: DynamoDB logging + external Dr. CI dashboard

## Configuration Files

- `Constants.ts:1` - Cherry-pick and revert classifications
- `subscriptions.ts:1` - Label subscription parsing utilities
- Repository-specific configs loaded via `CachedConfigTracker`

This bot ecosystem provides comprehensive automation for the PyTorch development workflow, balancing developer productivity with security and code quality requirements through intelligent automation and robust permission systems.