An AI-powered computer control system that uses computer vision and natural language processing to understand and interact with your computer's interface automatically. The system can identify UI elements, process natural language commands, and perform automated actions.
- Screen Element Detection: Advanced computer vision using SOM model for accurate UI element identification
- OCR Integration: Uses PaddleOCR for robust text detection and recognition
- Natural Language Processing: GPT-4-mini powered element extraction for understanding user commands
- Automated Control: Intelligent cursor movement and click actions based on visual understanding
- Real-time Processing: Continuous screenshot analysis and interaction
- Interactive CLI: User-friendly command-line interface with rich output formatting
- Python 3.11 or higher
- CUDA-compatible GPU (recommended) or CPU
- PyAutoGUI for computer control
- Required AI models:
- SOM model for element detection
- Caption model processor
- GPT-4-mini for command processing
- PaddleOCR for text recognition
- Clone the repository:
git clone https://github.com/yourusername/computer-control.git
cd computer-control
- Create a virtual environment and activate it:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the package with development dependencies:
pip install -e ".[dev]"
Key dependencies include:
pyautogui
: For computer control operationsPillow
: Image processingrich
: Terminal output formattinglangchain
: LLM integrationlangchain_openai
: OpenAI model integrationpaddleocr
: Text detection and recognition
- Start the computer control system:
python -m computer_control
-
The system will:
- Take periodic screenshots of your screen
- Process and analyze UI elements
- Wait for your natural language commands
- Execute the requested actions automatically
-
Example commands:
- "Click on the start menu"
- "Open the app folder"
- "Go to settings"
-
Screenshot Processing
- Automated screenshot capture
- Image preprocessing and conversion
- Dynamic scaling based on screen resolution
-
Element Detection
- SOM model-based element identification
- OCR text detection with PaddleOCR
- Coordinate mapping for precise interaction
-
Command Processing
- Natural language understanding with GPT-4-mini
- Element matching and validation
- Coordinate calculation for cursor movement
-
Action Execution
- Smooth cursor movement
- Click action verification
- Error handling and recovery
- Run tests:
pytest
- Format code:
black . && isort .
- Type checking:
mypy src tests
- Lint code:
ruff check .
computer-control/
├── src/
│ └── computer_control/
│ ├── core/ # Core control and processing logic
│ │ └── controller.py # Main controller implementation
│ ├── models/ # AI model implementations
│ │ └── element_extractor.py # Element extraction logic
│ ├── utils/ # Utility functions
│ │ └── vision.py # Computer vision utilities
│ └── __main__.py # Entry point
├── tests/ # Test suite
├── pyproject.toml # Project configuration
└── README.md # This file
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests and linting
- Submit a pull request
MIT License - See LICENSE file for details.
- SOM model for element detection
- Florence-2 caption model
- OpenAI for GPT-4-mini
- PaddlePaddle team for PaddleOCR