|
| 1 | +# Training Financial Deep Research Agent with RL using AgentScope-Tuner |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +DeepFinance is a reinforcement learning training framework for financial deep research agents. Instead of relying on human-annotated "gold answers", it drives the model to autonomously explore optimal research strategies through a **multi-dimensional reward system** (evidence traceability × analytical sufficiency × readability). |
| 6 | + |
| 7 | +## Task Setting |
| 8 | + |
| 9 | +### Agent Goal |
| 10 | + |
| 11 | +Given a financial research question (stock analysis / industry research / event interpretation / macro analysis / stock screening), the agent must: |
| 12 | +- Call financial tools to collect real-world data |
| 13 | +- Generate a Markdown research report with academic-style citations |
| 14 | +- End the report with the `[TASK_COMPLETED]` marker |
| 15 | + |
| 16 | +### Agent Type |
| 17 | + |
| 18 | +The agent is implemented as a **ReActAgent**, following a two-phase deep research methodology (defined in `prompt/finance_analyst_prompt.md`): |
| 19 | + |
| 20 | +**Phase 1: Outline First, Then Investigate** |
| 21 | +1. Identify the query type |
| 22 | +2. **Output a research outline first** (section headings + key questions per section) — no tool calls at this stage |
| 23 | +3. Investigate section by section, summarizing after each round of tool calls |
| 24 | + |
| 25 | +**Phase 2: Deep Analysis and Report Generation** |
| 26 | +1. Generate a Markdown-format research report based on real data |
| 27 | +2. If evidence gaps are found during writing, allow 1–2 additional rounds of tool calls |
| 28 | +3. Append `[TASK_COMPLETED]` at the end of the report |
| 29 | + |
| 30 | +> Why "plan first, then execute"? Letting the model freely explore in a complex tool environment typically leads not to "failing to call tools", but to "failing to form a complete research process" — the model grabs one piece of data and immediately starts local analysis, resulting in a loosely structured report. Requiring an outline first helps develop a stable research workflow and reduces ineffective exploration. |
| 31 | +
|
| 32 | +### Tool Environment |
| 33 | + |
| 34 | +The agent communicates with the [Finance MCP](https://github.com/flowllm-ai/finance-mcp) service via MCP (Model Context Protocol), using **19 financial tools** (defined in `prompt/tool_prompt_builder.py`): |
| 35 | +- **Entity & Computation**: entity extraction, A-share historical price calculation |
| 36 | +- **General Capabilities**: DashScope search, Python/Shell code execution |
| 37 | +- **THS Specialized Data**: company fundamentals, shareholders, financials, earnings forecasts, news & announcements, institutional holdings, and 13 other specialized queries |
| 38 | + |
| 39 | +**Tool Call Conventions:** |
| 40 | +- Up to **3 tools** per call, using multi-round progressive investigation |
| 41 | +- Summarize after each round of tool calls before deciding the next investigation direction |
| 42 | + |
| 43 | +### Reward Design |
| 44 | + |
| 45 | +The reward is split into **1 core objective + 3 constraints**: |
| 46 | + |
| 47 | +| Role | Dimension | Code Module | Core Question | |
| 48 | +| :--- | :--- | :--- | :--- | |
| 49 | +| **Core** | Analytical Sufficiency (RM) | `judge/finance/` | Is the analysis thorough? Is the logic sound? | |
| 50 | +| Constraint | Presentation Quality | `judge/presentation_quality/` | Is information easy to access? Good reader experience? | |
| 51 | +| Constraint | Citation Grounding | `judge/grounding/` | Are key facts cited? Are citations real? | |
| 52 | +| Constraint | Citation Audit | `judge/audit/` | Do citations truly support the claims? | |
| 53 | + |
| 54 | +**Scoring (Extract First, Then Score)**: The LLM first extracts structured information from the report (citations, evidence relationships, etc.), then Python rules compute the scores. For example, the Audit grader only requires the LLM to classify each citation as Supported / Overstated / Contradicted / Hallucinated / Irrelevant, and the final score is computed by rule-based code. |
| 55 | + |
| 56 | +**Tool Call Penalty** (defined in `deep_finance_judge.py`): |
| 57 | + |
| 58 | +| Tool Calls | Penalty | |
| 59 | +| :--- | :--- | |
| 60 | +| 0 calls | -1.0 | |
| 61 | +| 1–2 calls | -0.5 | |
| 62 | +| ≥ 3 calls | 0.0 (no penalty) | |
| 63 | + |
| 64 | +**Default Weights** (configurable in `deepfinance_tuner.sh`): |
| 65 | +```bash |
| 66 | +RM_WEIGHT=0.5 # Analytical sufficiency (core objective) |
| 67 | +PRESENTATION_QUALITY_WEIGHT=0.2 # Presentation quality |
| 68 | +GROUNDING_WEIGHT=0.1 # Citation grounding |
| 69 | +AUDIT_WEIGHT=0.2 # Citation audit |
| 70 | +``` |
| 71 | + |
| 72 | + |
| 73 | +## Code Implementation |
| 74 | + |
| 75 | +### High-Level Overview |
| 76 | + |
| 77 | +The implementation consists of three main components: |
| 78 | +1. **Workflow** (`run_deep_finance`): ReActAgent + Finance MCP tool interaction loop |
| 79 | +2. **Judge** (`deep_finance_judge`): Multi-dimensional evaluation engine, combining OpenJudge + rule-based scoring |
| 80 | +3. **Entry** (`main.py`): Calls `tune()` to launch training |
| 81 | + |
| 82 | +### Agent Workflow |
| 83 | + |
| 84 | +`run_deep_finance` implements the agent–tool interaction loop: |
| 85 | + |
| 86 | +```python |
| 87 | +async def run_deep_finance( |
| 88 | + task: Dict[str, Any], |
| 89 | + model: OpenAIChatModel, |
| 90 | + auxiliary_models: Dict[str, OpenAIChatModel] | None = None, |
| 91 | +) -> WorkflowOutput: |
| 92 | + # 1. Extract system prompt and user query |
| 93 | + sys_prompt, user_query = _extract_sys_and_user(task) |
| 94 | + |
| 95 | + # 2. Get Finance MCP toolkit (process-local singleton, lazily loaded) |
| 96 | + toolkit = await get_finance_mcp_toolkit() |
| 97 | + |
| 98 | + # 3. Create ReActAgent |
| 99 | + agent = ReActAgent( |
| 100 | + name="deep_finance_react", |
| 101 | + sys_prompt=sys_prompt, |
| 102 | + model=model, |
| 103 | + enable_meta_tool=False, |
| 104 | + formatter=OpenAIChatFormatter(), |
| 105 | + toolkit=toolkit, |
| 106 | + ) |
| 107 | + |
| 108 | + # 4. Execute research task |
| 109 | + response = await agent.reply(msg=Msg("user", user_query, role="user")) |
| 110 | + |
| 111 | + # 5. Extract tool call statistics |
| 112 | + tool_stats = await extract_tool_stats_from_agent(agent, total_time) |
| 113 | + metrics = compute_single_tool_metrics(tool_stats) |
| 114 | + |
| 115 | + return WorkflowOutput(response=response_dict, metrics=metrics) |
| 116 | +``` |
| 117 | + |
| 118 | +**Key Features:** |
| 119 | +- MCP Toolkit is lazily loaded as a singleton per worker process, with built-in jitter to prevent thundering herd |
| 120 | +- System prompt is dynamically generated from `prompt/finance_analyst_prompt.md` (injecting current date and tool list) |
| 121 | + |
| 122 | +### Judge Function |
| 123 | + |
| 124 | +`deep_finance_judge` uses `DeepFinanceJudgeEngine` for multi-dimensional evaluation: |
| 125 | + |
| 126 | +```python |
| 127 | +async def deep_finance_judge( |
| 128 | + task: Dict[str, Any], |
| 129 | + response: Any, |
| 130 | + auxiliary_models: Dict[str, ChatModelBase] | None = None, |
| 131 | +) -> JudgeOutput: |
| 132 | + engine = _get_judge_engine() |
| 133 | + reward, metrics = await engine.evaluate_one(task=task, response=response) |
| 134 | + return JudgeOutput(reward=reward, metrics=metrics) |
| 135 | +``` |
| 136 | + |
| 137 | +Evaluation flow: |
| 138 | +1. Build conversation history from response, convert to OpenJudge format |
| 139 | +2. Run multiple graders in parallel (Presentation Quality / Citation Grounding / Citation Audit) |
| 140 | +3. Run Finance RM (pairwise evaluation using a dedicated stronger model) |
| 141 | +4. Fuse scores + tool call penalty → final reward |
| 142 | + |
| 143 | +### Launch Training with `tune()` |
| 144 | + |
| 145 | +```python |
| 146 | +from agentscope.tuner import tune |
| 147 | + |
| 148 | +tune( |
| 149 | + workflow_func=run_deep_finance, |
| 150 | + judge_func=deep_finance_judge, |
| 151 | + config_path="config_template.yaml", |
| 152 | +) |
| 153 | +``` |
| 154 | + |
| 155 | +For training configuration, refer to [config_template.yaml](./config_template.yaml). For full configuration details, see the [Trinity-RFT Configuration Guide](https://agentscope-ai.github.io/Trinity-RFT/en/main/tutorial/trinity_configs.html). |
| 156 | + |
| 157 | +## How to Run |
| 158 | + |
| 159 | +### Dependencies |
| 160 | + |
| 161 | +```bash |
| 162 | +# Recommended: use conda or uv to manage virtual environments |
| 163 | +conda create -n tune_example python=3.11 |
| 164 | +conda activate tune_example |
| 165 | + |
| 166 | +# Install core dependencies |
| 167 | +pip install agentscope vllm ray wandb |
| 168 | + |
| 169 | +# Install OpenJudge |
| 170 | +git clone https://github.com/agentscope-ai/OpenJudge.git |
| 171 | +cd OpenJudge |
| 172 | +pip install -e . |
| 173 | +``` |
| 174 | + |
| 175 | +### Step 1: Install and Start Finance MCP Service |
| 176 | + |
| 177 | +Finance MCP provides the financial tool suite (search, web crawling, THS data, etc.). |
| 178 | + |
| 179 | +**Install:** |
| 180 | +```bash |
| 181 | +pip install finance-mcp |
| 182 | +``` |
| 183 | + |
| 184 | +**Start the service (SSE mode):** |
| 185 | +```bash |
| 186 | +finance-mcp \ |
| 187 | + config=default,ths,crawl \ |
| 188 | + disabled_flows='["tavily_search","mock_search","react_agent"]' \ |
| 189 | + mcp.transport=sse \ |
| 190 | + mcp.port=8040 |
| 191 | +``` |
| 192 | + |
| 193 | +The service will be available at: `http://<server_IP>:8040/sse` (use `127.0.0.1` for local, replace with actual IP for remote access) |
| 194 | + |
| 195 | +**Required API Keys (configure as needed in `.env`):** |
| 196 | + |
| 197 | +| Variable | Purpose | |
| 198 | +|----------|---------| |
| 199 | +| `DASHSCOPE_API_KEY` | DashScope search | |
| 200 | +| `TUSHARE_API_TOKEN` | China A-share historical data | |
| 201 | +| `TAVILY_API_KEY` | Tavily search (optional) | |
| 202 | + |
| 203 | +### Step 2: Configure Environment Variables |
| 204 | + |
| 205 | +Copy `tuner/deep_finance/.env.example`, rename it to `.env`, and place it in the project root: |
| 206 | + |
| 207 | +```bash |
| 208 | +# ==================== .env ==================== |
| 209 | +# API keys (for Judge scoring and external tools) |
| 210 | +OPENJUDGE_API_KEY="sk-xxx" |
| 211 | +OPENJUDGE_BASE_URL="https://dashscope.aliyuncs.com/compatible-mode/v1" |
| 212 | + |
| 213 | +# Base model and environment paths |
| 214 | +MODEL_PATH="/path/to/base_model" |
| 215 | +CONDA_PATH="/path/to/conda/conda.sh" |
| 216 | +CONDA_ENV="tune_example" |
| 217 | + |
| 218 | +# Data and reference answer paths |
| 219 | +DATA_PATH="/path/to/train_data_dir" |
| 220 | +TRAIN_REF_ANS_PATH="/path/to/train_reference_answer.json" |
| 221 | +VAL_REF_ANS_PATH="/path/to/val_reference_answer.json" |
| 222 | + |
| 223 | +# Cluster config (set WORLD_SIZE to 1 for single-machine) |
| 224 | +WORLD_SIZE=1 |
| 225 | +MASTER_ADDR="127.0.0.1" |
| 226 | + |
| 227 | +# Finance MCP service URL |
| 228 | +FINANCE_MCP_URL="http://127.0.0.1:8040/sse" |
| 229 | +``` |
| 230 | + |
| 231 | +### Step 3: Launch Training |
| 232 | + |
| 233 | +No need to manually edit Python or YAML files. The launch script `deepfinance_tuner.sh` dynamically generates `config_template.yaml` and automatically starts the Ray cluster. |
| 234 | + |
| 235 | +```bash |
| 236 | +bash deepfinance_tuner.sh |
| 237 | +``` |
| 238 | + |
| 239 | +**Key training parameters (configurable in `deepfinance_tuner.sh`):** |
| 240 | + |
| 241 | +| Shell Parameter | Tuner Parameter | Default | Description | |
| 242 | +| :--- | :--- | :--- | :--- | |
| 243 | +| `GROUP_SIZE` | `repeat_times` | 4 | Parallel rollout samples per query | |
| 244 | +| `MAX_ENV_STEPS` | `max_env_steps` | 10 | Max agent-environment interaction rounds | |
| 245 | +| `BATCH_SIZE` | `batch_size` | 64 | Global batch size | |
| 246 | +| `OPENJUDGE_LLM` | `openjudge_llm` | qwen-flash | General model for OpenJudge scoring | |
| 247 | +| `FINANCE_JUDGE_LLM` | `finance_judge_llm` | qwen-max | Stronger model for financial analysis depth evaluation | |
| 248 | +| `ENGINE_NUM` | `engine_num` | Node // 2 | Number of vLLM async inference engines | |
| 249 | +| `GPU_PER_NODE` | `gpu_per_node` | 8 | GPUs per node | |
| 250 | + |
| 251 | +## Code Structure |
| 252 | + |
| 253 | +``` |
| 254 | +deep_finance/ |
| 255 | +├── main.py # Entry: defines workflow and judge functions |
| 256 | +├── deep_finance_judge.py # Judge engine: multi-grader fusion + reward computation |
| 257 | +├── config_template.yaml # Tuner config template (dynamically generated by shell script) |
| 258 | +├── deepfinance_tuner.sh # Multi-node distributed launch script |
| 259 | +├── deepfinance_tuner_single.sh # Single-machine launch script |
| 260 | +├── .env.example # Environment variable template |
| 261 | +├── judge/ |
| 262 | +│ ├── finance/ # RM: domain-routed pairwise evaluation |
| 263 | +│ ├── presentation_quality/ # Presentation: 8-dimension rule-based scoring |
| 264 | +│ ├── grounding/ # Grounding: coverage + authenticity |
| 265 | +│ ├── audit/ # Audit: 5-level verdict classification |
| 266 | +│ └── traj_adapter.py # Trajectory format normalization |
| 267 | +├── metric_helper/ |
| 268 | +│ ├── reward_metric_helper.py # Reward metrics aggregation |
| 269 | +│ └── tool_metric_helper.py # Tool call statistics |
| 270 | +└── prompt/ |
| 271 | + ├── finance_analyst_prompt.md # Agent system prompt (two-phase research flow) |
| 272 | + └── tool_prompt_builder.py # Tool documentation generator (19 financial tools) |
| 273 | +``` |
0 commit comments