Release ProactiveAgent

thunlp · Sep 19, 2024 · 50940be · 50940be
1 parent fc30015
commit 50940be
Show file tree

Hide file tree

Showing 112 changed files with 44,772 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,27 @@
+gym/history/*
+**/config.toml
+**/cache
+**/__pycache__/**
+config_private.yaml
+annotation/*
+**/.DS_Store
+dataset/config.toml
+export_simulation/*
+agent/appid.txt
+eval/config.yaml
+rm_result.json
+eval/code_1_rewarded.json
+**/log/**
+**/judged*/**
+ablations/**
+**/private.toml
+.vscode
+test.jsonl
+dataset/annotation/result
+dataset/analyse_scenes.ipynb
+dataset/agent_data
+events.log
+gym/cfg_test.yaml
+dataset/analyse_scenes.ipynb
+dataset/agent_data/**
+agent_data.zip
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "envs/aw-watcher-web"]
+	path = envs/aw-watcher-web
+	url = https://github.com/luyaxi/aw-watcher-web.git
diff --git a/README.md b/README.md
@@ -0,0 +1,180 @@
+<div align= "center">
+    <h1> Proactive Agent </h1>
+</div>
+
+<div align="center">
+
+![Dialogues](https://img.shields.io/badge/Current\_Event\_Size-6790-red?style=flat-square)
+
+</div>
+
+<p align="center">
+  <a href="#overview">Model</a> •
+  <a href="#data">Data Release</a> •
+  <a href="#usage">Usage</a> •
+  <a href="#citation">Citation</a>
+
+</p>
+
+</div>
+
+This project (Proactive Agent) aims to construct a fully active agent, who may anticipate user's requirements and take the initiative, offering assistance and suggesting actions without explicit requests from user. We achieve this by developing a data collection and generation pipeline, building an automatic evaluator and training agent within data generated. For now, we provide the whole collection and generation pipeline, the datasets, and the corresponding evaluation scripts, and the prompts to finetune LLM for proactive agent.
+
+*Read this in [中文](README_zh.md).*
+
+## Overview
+
+✨Here is an overview of the whole process of Proactive Agent.
+
+<br>
+<div align="center">
+<img src="assets/overall_pipeline.png" width="800px">
+</div>
+<br>
+
+
+✨✨Features:
+- **Environment Sensing**: We provide scripts to collect environment scenes and user activities through Activity Watcher, and recommend tasks automatically based on the model.
+- **Assistance Annotation**: We provide a platform to annotate the response generated by the proactive agent, which is a good way to align the result with human annotators.
+- **Dynamic Generation**: We provide a dynamic pipeline to generate new data, the feedback from user could affect events afterwards.
+- **Construction Pipeline**: We provide a generation pipeline consist of **Environment Gym**, **Proactive Agent** and **Reward Model**, where our Reward Model reaches a `0.918` F1 score on the test set.
+
+A demo is also provided to show the performance of our agent.
+
+https://github.com/user-attachments/assets/4a9152e8-15ee-4bdf-a917-52a9e6b4f375
+
+In the future, we will continually improve the data quality and increase the coverage of real-world scenarios.
+
+## Data
+
+👐Proactive Agent is intended for coding, writing and daily life scenarios only at present and should not be constructed as reflecting the opinions or views of the creators, owners, or contributors of this dataset. It is distributed under Apache License 2.0. Below is the statistics of the data:
+
+| Settings | Coding | Writing | Daily Life | Total |
+|:-------------:|:------:|:-----:|:-----:|:-----:|
+|   Inst.Num    | 23 |  23 | 22 | 68 |
+|   Events Num  | 2275 |  2354 | 2161 | 6790 |
+
+All the training instances for the Proactive Agent were generated from our [GYM](gym/README.md).
+We utilize the [Activity Watcher](https://activitywatch.net/) to collect the human traces across all the scenes, and annotate a test set to validate the effectiveness of the Proactive Agent.
+More details about the data collection and annotation can be found [here](dataset/README.md).
+
+## 📦 Installation
+
+Clone this repository and navigate to the proactive demand sensing agent folder
+```bash
+git clone [email protected]:OpenBMB/ProactiveAgent
+cd ProactiveAgent
+```
+
+Install Package (python>=3.10)
+```bash
+pip install -r requirements.txt
+```
+
+### Install Activity Watcher
+
+- You can go to the [Official Website](https://activitywatch.net/downloads/) to download the main app based on your operating system.
+- An extension for chrome is at `./agent/resource/aw-watcher-web.zip`. To download this extension, you will have to download the file and unzip it.
+  - For Edge users, go to `edge://extensions/` site, open developer mode and load the extension by clicking `load unpacked`.`
+  - For Google Chrome users, go to `chrome://extensions/` site, open developer mode and select `load unpacked` to load the unziped extension.
+  - This Extension is not tested under `Safari`.
+- There is an official extension for vscode user, you may download it from the [marketplace](https://marketplace.visualstudio.com/items?itemName=activitywatch.aw-watcher-vscode) or search for `aw-watcher-vscode` in the extensions in your vscode and install it.
+
+To check whether the installation is complete, please open your browser and go to `http://localhost:5600/#/timeline` to check if there are four traces displaying in the window(`afk`,`vscode`,`window`,`web`). 
+
+
+## 🚀 Usage
+
+### Configuration
+You should first configure the `private.toml` file. The example is given in `example_config.toml`:
+
+```bash
+cp example_config.toml private.toml
+```
+
+You should change the `default_completions_model`, `api_key` and `base_url` to your own settings.
+
+
+### Running the Proactive Agent
+To experience our proactive agent, you will first enter folder `./agent` and then follow the instructions in [here](agent/README.md).
+
+
+### Connect the Reward Model
+To improve the experience with the Proactive Agent, you can use our built reward model to filter the message from the Proactive Agent.
+Here are steps to connect the reward model with the Proactive Agent.
+__TO BE UPDATE__
+
+
+### Interact with the Proactive Agent
+Our agent will try to make a proposal by creating a toast on the window, to interact with the proactive agent, you may choose:
+- Accept the proposal: you will click on the toast body(Windows) or click the button(MacOS) to let the agent know you accept his idea, the agent will make relavent actions in return.
+- Reject the proposal: you will click on the dismiss button(the x on the top right of the toast) to let the agent know you **reject** the proposal, the agent will try to propose in some other way for next turn.
+- Ignore the proposal: you will do nothing, the agent will remove the toast for some time depending on the time interval, doing nothing will make the agent know that you are busy and ignored the proposal, the agent will try to make less proposal in the following turns.
+
+
+## 📊 Model Results
+To automatic evaluate the performance of the Proactive Agent, we build a reward model based on our annotated data to judge the performance of the Proactive Agent.
+Our reward model reaches a `0.918` F1 score on the test set, which is a good indicator of the performance of the Proactive Agent.
+
+### Reward Model Experiments Results
+We test the agreement between the reward model and human annotators on the test set:
+- **Missed-Needed (MN)**: The scenario when the user needs help but the agent does not provide help.
+- **Non-Response(NR)**: The scenario when the user does not need help and the agent does not prompt any help.
+- **Correct-Detection(CD)**: The scenario when the user needs help and the agent provides help.
+- **False-Alarm(FA)**: The scenario when the user does not need help but the agent prompts help.
+
+We compare the judgement of the reward model with the human annotators.
+We compare the performance of different LLMs and our model on the [test set](eval/README.md#reward-model-evaluation).
+The results are as follows:
+
+|                         | GPT-4o  | GPT-4o-mini | LLaMa 3.1 8b | LLaMa 3.1 70b  | ours    |
+|-------------------------|---------|-------------|--------------|----------------|---------|
+| Missed-Need (MN)        | 0.0333  | 0.5667      | 0.8000       | 0.3333         | 0.8000  |
+| Non-Response (NR)       | 1.0000  | 0.5667      | 0.3000       | 0.8333         | 0.8667  |
+| Correct-Detection (CD)  | 1.0000  | 0.8667      | 0.9667       | 1.0000         | 1.0000  |
+| False-Alarm (FA)        | 0.0000  | 0.3333      | 0.1333       | 0.0667         | 1.0000  |
+| Accuracy                | 0.5083  | 0.5833      | 0.5500       | 0.5583         | **0.9167**  |
+| Precision               | 0.5042  | 0.5658      | 0.5429       | 0.5340         | **0.9032**  |
+| Recall                  | 1.0000  | 0.7167      | 0.6333       | 0.9167         | **0.9333**  |
+| F1                      | 0.6704  | 0.6324      | 0.5846       | 0.6748         | **0.9180**  |
+
+
+### Proactive Agent Experiments Results
+In current experiments, we evaluate the performance of the Proactive Agent with our Reward Model.
+We define the following metrics:
+- **True Positive(TP)**: Instances where the proactive agent correctly predicts a task that the reward model subsequently accepts.
+- **False Positive(FP)**: Instances where the proactive agent predicts a task that the reward model does not accept.
+- **True Negative(TN)**: Instances where the proactive agent correctly refrains from predicting a task,and the reward model also does not accept any task.
+- **False Negative(FN)**: Instances where the proactive agent fails to predict a task that the reward model would have accepted if proposed.
+
+We report the performance of the Proactive Agent on the test set of the [ProactiveBench](eval/README.md).
+
+| Model                       | Recall  | Precision | Accuracy | False-Alarm | F1-Score  |
+|-----------------------------|---------|-----------|----------|-------------|-----------|
+| Claude-3-Sonnet             | 0.6321  | 0.5000    | 0.5330   | 0.5000      | 0.5583    |
+| Claude-3.5-Sonnet           | 0.9663  | 0.4195    | 0.4626   | 0.5805      | 0.5850    |
+| GPT-4o-mini                 | 1.0000  | 0.3467    | 0.3524   | 0.6533      | 0.5149    |
+| GPT-4o                      | 1.0000  | 0.4956    | 0.4978   | 0.5044      | __0.6627__  |
+| LLaMA 3.1 8B                | 0.9877  | 0.3571    | 0.3612   | 0.6429      | 0.5246    |
+| LLaMA 3.1 8B Proactive      | 0.9600  | 0.4550    | 0.4758   | 0.5450      | 0.6174    |
+| Qwen2 7B Instruct           | 1.0000  | 0.4361    | 0.4361   | 0.5639      | 0.6074    |
+| Qwen2 7B Instruct Proactive | 1.0000  | 0.4978    | 0.5066   | 0.5022      | **0.6647** |
+
+
+
+## Citation
+If you find this project useful in your research, please consider citing it:
+```
+@misc{2024,
+  author = {OpenBMB},
+  title = {ProactiveAgent},
+  year = {2024},
+  publisher = {GitHub},
+  journal = {GitHub Repository},
+  howpublished = {\url{https://github.com/openBMB/ProactiveAgent}}
+}
+```
+
+## Friendly Links
+- [ChatDev](https://github.com/OpenBMB/ChatDev)
+- [Activity Watcher](https://activitywatch.net/)
diff --git a/README_zh.md b/README_zh.md
@@ -0,0 +1,172 @@
+<div align= "center">
+    <h1> 主动智能体 </h1>
+</div>
+
+<div align="center">
+
+![Dialogues](https://img.shields.io/badge/Current\_Event\_Size-11876-red?style=flat-square)
+
+</div>
+
+<p align="center">
+  <a href="#概述">模型</a> •
+  <a href="#数据">数据公布</a> •
+  <a href="#使用">使用</a> •
+  <a href="#引用">引用</a>
+
+</p>
+
+</div>
+
+该项目(主动智能体)目标为构建一个完全主动的，可预测用户需要并主动帮助，在没有用户显式要求的情况下提供帮助与行为的智能体。我们通过数据收集，生成流水线，构建自动评估机以及通过生成的数据进行训练的方式实现这一目标。目前，我们将提供全套采集与生成流水线，数据集，以及相应的评估脚本，以及用于微调LLM的提示。
+
+使用 [English](README.md) 阅读此文档。
+
+## 概述
+
+✨下图为主动智能体的整体流程的概述图。
+
+<br>
+<div align="center">
+<img src="assets/overall_pipeline.png" width="800px">
+</div>
+<br>
+
+
+✨✨特点：
+- **环境感知**： 我们通过 Activity Watcher 采集环境信息与用户行为数据，并基于模型自动推荐任务。
+- **帮助标注**：我们提供一个用于标注主动智能体生成的响应的平台，这是一种与人类标注者对齐结果的好方法。
+- **动态生成**：我们提供了一个动态流水线，用于生成新数据，用户的反馈会影响事件后续的流程。
+- **构建流水线**：我们提供了一个由**环境模拟器**，**主动智能体**和**奖励模型**组成的生成流水线，其中我们的奖励模型与人类标注者之间的一致率为 $91 \%$。
+
+我们同样提供了一个展示我们智能体的简易展示。
+
+https://github.com/user-attachments/assets/4a9152e8-15ee-4bdf-a917-52a9e6b4f375
+
+未来，我们会持续提升改善质量，增加对于真实场景的覆盖。
+
+
+## 数据
+
+👐 主动智能体目前仅设计用于编程，写作和日常生活场景，不应被视为反映创建者，所有者或贡献者的观点。它根据 Apache License 2.0 分发。以下是数据统计：
+
+
+| 场景 | 编程 | 写作 | 日常生活 | 合计 |
+|:-------------:|:------:|:-----:|:-----:|:-----:|
+|   实例数目   | 23 |  23 | 22 | 68 |
+|   事件数目   | 2275 |  2354 | 2161 | 6790 |
+
+所有用于主动智能体的训练实例生成自我们的 [环境模拟器](gym/README_zh.md)
+我们利用 [Activity Watcher](https://activitywatch.net/) 在所有的场景下采集人类序列，并且标注了一个测试集以验证主动智能体的效率。
+对于数据采集和标注的更多细节可参考[这里](dataset/readme_zh.md)
+
+
+## 📦 下载
+克隆该项目，并且进入主动智能体文件夹下。
+```bash
+git clone [email protected]:OpenBMB/ProactiveAgent
+cd ProactiveAgent
+```
+
+下载相关依赖(要求 python 版本不低于 3.10)
+```bash
+pip install -r requirements.txt
+```
+
+### 下载 Activity Watcher
+- 你可以前往 [官方网站](https://activitywatch.net/downloads/) 以根据你的操作系统下载对应的软件。
+- 一个基于浏览器的插件放于 `./agent/resource/aw-watcher-web.zip`。 为了下载该插件，你需要下载并解压该文件。
+  - 对于 Edge 用户，前往 `edge://extensions/` 处，打开 开发者模式，并通过点击 `加载解压缩的扩展` 加载解压缩的扩展。
+  - 对于 Google Chrome 用户，前往 `chrome://extensions/` 处，打开 开发者模式，并选择 `加载解压缩的扩展` 以加载解压后的插件。
+  - 该插件尚未在 `Safari` 上加载过。
+- 对于 vscode 用户而言，有一个官方插件，你可以通过[插件市场](https://marketplace.visualstudio.com/items?itemName=activitywatch.aw-watcher-vscode)下载该插件，或者在 vscode 的插件处搜索 `aw-watcher-vscode` 并下载插件。
+
+为了检查安装是否完整，请打开浏览器并前往 `http://localhost:5600/#/timeline` 查看窗口处是否会展示四条轨迹(`afk`,`vscode`,`window`,`web`)。
+
+## 🚀 使用
+
+### 配置
+为了使用主动智能体，你需要配置 `private.toml` 文件。你可以参考 `example_config.toml` 文件进行配置:
+
+```bash
+cp example_config.toml private.toml
+```
+
+你需要更改 `default_completions_model`, `api_key` 和 `base_url` 为你自己的设置。
+
+
+### 运行主动智能体
+为了体验我们的主动智能体，你应当进入 `./agent` 文件夹下并参考[此处](agent/README_zh.md) 的指导。
+
+### 与奖励模型连接
+为了提升使用主动智能体的体验，你可以使用我们构建的奖励模型以筛选来自主动智能体的信息。
+一下为连接奖励模型与主动智能体的步骤。
+__敬请期待__
+
+### 与主动智能体交互
+我们的智能体将会尝试通在窗口创建 toast 来提出帮助， 为了与主动智能体交互，你可以选择：
+- 接受帮助：你将通过点击 toast 内容(windows) 或者点击按钮(MacOS) 来让 agent 了解到你需要帮助，而智能体将会为之执行相应的行为。
+- 拒绝帮助：你将通过点击关闭按钮(toast 右上的 x 键)来让智能体知道你 **拒绝** 了它的帮助，智能体将会尝试在下一轮提出不同的帮助。
+- 忽略帮助：你将什么也不做。智能体将会根据设置的间隔在一段时间后自动移除 toast, 什么也不做将会让智能体知道你正在忙碌而忽视了其帮助，智能体将会尝试之后更少地提供帮助。
+
+## 📊 模型结果
+
+为了自动评估主动智能体的性能，我们通过我们的标注数据构建了一个奖励模型来进行评价。
+我们的奖励模型在测试集上 F1 分数达到了 `0.918`，这是对主动智能体性能的一个好的指标。
+
+### 奖励模型实验结果
+我们通过以下四个准则衡量奖励模型和人工标注员的一致性：
+- **需求遗落 (MN)**:人工标注认为需要帮助而奖励模型认为无需帮助的轮数与总轮数之比。
+- **静默应答 (NR)**:人工标注和奖励模型都认为无需帮助的轮数与总轮数之比。
+- **正确检测 (CD)**:人工标注和奖励模型都认为需要帮助的轮数与总轮数之比。
+- **错误报告 (FA)**:人工标注认为无需帮助而奖励模型认为需要帮助的轮数与总轮数之比。
+
+我们比对了奖励模型与人工标注员的评价，我们在[测试集](eval/README_zh.md#奖励模型评估)下比较了不同大语言模型和我们的模型的性能。
+结果如下：
+|                         | GPT-4o  | GPT-4o-mini | LLaMa 3.1 8b | LLaMa 3.1 70b  | ours    |
+|-------------------------|---------|-------------|--------------|----------------|---------|
+| 需求遗落        | 0.0333  | 0.5667      | 0.8000       | 0.3333         | 0.8000  |
+| 静默应答        | 1.0000  | 0.5667      | 0.3000       | 0.8333         | 0.8667  |
+| 正确检测        | 1.0000  | 0.8667      | 0.9667       | 1.0000         | 1.0000  |
+| 错误预报        | 0.0000  | 0.3333      | 0.1333       | 0.0667         | 1.0000  |
+| 准确率          | 0.5083  | 0.5833      | 0.5500       | 0.5583         | **0.9167**  |
+| 精确率          | 0.5042  | 0.5658      | 0.5429       | 0.5340         | **0.9032**  |
+| 召回率          | 1.0000  | 0.7167      | 0.6333       | 0.9167         | **0.9333**  |
+| F1 分数        | 0.6704  | 0.6324      | 0.5846       | 0.6748         | **0.9180**  |
+
+### 主动智能体实验结果
+当前试验下，所有的大语言模型都在三个场景下评分，并以下面的准则评价：
+- **True Positive(TP)**: 主动智能体预测任务并被奖励模型接受的实例。
+- **False Positive(FP)**: 主动智能体预测任务但不被奖励模型接受的实例。
+- **True Negative(TN)**: 主动智能体不预测任务，奖励模型无需接受任务的实例。
+- **False Negative(FN)**: 奖励模型期待任务但主动智能体预测失败的实例。
+
+我们在 [ProactiveBench](eval/README_zh.md) 的测试集下报告主动智能体的性能。
+
+| 模型                         | 召回率   | 精确率     | 准确率    | 误报率       | F1-Score  |
+|-----------------------------|---------|-----------|----------|-------------|-----------|
+| Claude-3-Sonnet             | 0.6321  | 0.5000    | 0.5330   | 0.5000      | 0.5583    |
+| Claude-3.5-Sonnet           | 0.9663  | 0.4195    | 0.4626   | 0.5805      | 0.5850    |
+| GPT-4o-mini                 | 1.0000  | 0.3467    | 0.3524   | 0.6533      | 0.5149    |
+| GPT-4o                      | 1.0000  | 0.4956    | 0.4978   | 0.5044      | __0.6627__|
+| LLaMA 3.1 8B                | 0.9877  | 0.3571    | 0.3612   | 0.6429      | 0.5246    |
+| LLaMA 3.1 8B Proactive      | 0.9600  | 0.4550    | 0.4758   | 0.5450      | 0.6174    |
+| Qwen2 7B Instruct           | 1.0000  | 0.4361    | 0.4361   | 0.5639      | 0.6074    |
+| Qwen2 7B Instruct Proactive | 1.0000  | 0.4978    | 0.5066   | 0.5022      | **0.6647**|
+
+
+## 引用
+如果你认为该项目对你的研究有帮助，请考虑引用：
+```@misc{2024,
+  author = {OpenBMB},
+  title = {ProactiveAgent},
+  year = {2024},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/openBMB/ProactiveAgent}}
+}
+```
+
+## 友链
+- [ChatDev](https://github.com/OpenBMB/ChatDev)
+- [Activity Watcher](https://activitywatch.net/)