Skip to content

Commit 8dfd443

Browse files
xubinruipzp5700
andauthored
add new webagent page (#162)
Co-authored-by: asd765973346 <pzp0057@sjtu.edu.cn>
1 parent bc09cd3 commit 8dfd443

4 files changed

Lines changed: 748 additions & 2 deletions

File tree

docs/.vuepress/notes/en/guide.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,8 @@ export const Guide: ThemeNote = defineNoteConfig({
121121
"operator_qa",
122122
"operator_write",
123123
"pipeline_prompt",
124-
"pipeline_rec&refine"
124+
"pipeline_rec&refine",
125+
"web_collection"
125126
]
126127
},
127128
],

docs/.vuepress/notes/zh/guide.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,8 @@ export const Guide: ThemeNote = defineNoteConfig({
120120
"operator_qa",
121121
"operator_write",
122122
"pipeline_prompt",
123-
"pipeline_rec&refine"
123+
"pipeline_rec&refine",
124+
"web_collection"
124125
]
125126
},
126127
// {
Lines changed: 372 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,372 @@
1+
---
2+
title: Web Data Collection
3+
createTime: 2026/02/14 00:00:00
4+
permalink: /en/guide/agent/web_collection/
5+
---
6+
7+
## 1. Overview
8+
9+
**Web Collection Agent** is the intelligent data collection module in DataFlow-Agent, designed to automatically collect, process, and format training datasets from the internet. The system supports two data types:
10+
11+
- **PT (Pre-Training)**: Large-scale unlabeled corpora for model pre-training.
12+
- **SFT (Supervised Fine-Tuning)**: Structured instruction-response pairs for model fine-tuning.
13+
14+
The workflow is capable of:
15+
16+
1. **Web Search & Exploration**: Multi-layer BFS forest exploration strategy with LLM-driven URL filtering to automatically discover and locate target datasets.
17+
2. **Multi-Platform Download**: Supports HuggingFace, Kaggle, and direct web download, with LLM intelligently deciding the download priority order.
18+
3. **Dual-Channel Parallel Collection**: WebSearch and WebCrawler pipelines run in parallel, providing richer data sources.
19+
4. **Adaptive Data Mapping**: LLM generates Python mapping functions with a triple-verification mechanism to automatically convert heterogeneous data into standard Alpaca format.
20+
21+
## 2. System Architecture
22+
23+
This function is orchestrated by `dataflow_agent/workflow/wf_web_collection.py`, forming a directed graph with parallel branches and conditional loops. The overall process is divided into four phases: task analysis, data collection (parallel), data download, and data processing & mapping.
24+
25+
### 2.1 Task Analysis Phase
26+
27+
1. **Start Node**
28+
1. **Responsibility**: Initializes the workflow configuration, creates the download directory, and prepares the execution environment.
29+
2. **Input**: `state.request.target` (user's original requirement).
30+
3. **Output**: Initialized `user_query` and download directory.
31+
32+
2. **Task Decomposer**
33+
1. **Responsibility**: Uses LLM to decompose complex user requirements into executable subtasks, with a maximum task limit (default 5).
34+
2. **Input**: User's original query.
35+
3. **LLM Thinking**: Analyzes the semantic meaning of the requirement and splits it into independent data collection subtasks.
36+
4. **Output**: `state.task_list`, for example:
37+
- Subtask 1: Collect NLP Q&A datasets
38+
- Subtask 2: Collect text classification datasets
39+
- Subtask 3: Collect image classification datasets
40+
41+
3. **Category Classifier**
42+
1. **Responsibility**: Determines whether the current task belongs to the PT or SFT type.
43+
2. **Input**: Current subtask name.
44+
3. **LLM Thinking**: Determines the data category based on the task description and generates a dataset background description.
45+
4. **Output**: `state.category` (`"PT"` or `"SFT"`) and `dataset_background`.
46+
5. **Fallback Mechanism**: When LLM cannot determine the category, keyword matching is used. SFT keywords include: `["sft", "fine-tuning", "qa", "instruction", "chat", "dialogue"]`.
47+
48+
### 2.2 Data Collection Phase (Parallel Execution)
49+
50+
After task analysis is complete, the system enters the `parallel_collection` parallel branch, simultaneously launching two collection pipelines: WebSearch and WebCrawler.
51+
52+
#### 2.2.1 WebSearch Node
53+
54+
WebSearch Node is the core data collection node of the system, implementing a complete web exploration and information extraction pipeline with the following core components:
55+
56+
1. **QueryGenerator**
57+
- **Responsibility**: Generates 3-5 diversified search queries based on the user's original requirement.
58+
- **Example**: Input `"Collect Python code generation datasets"`, output:
59+
- `"Python code generation dataset download"`
60+
- `"Python programming instruction dataset HuggingFace"`
61+
- `"code completion training data GitHub"`
62+
63+
2. **WebTools**
64+
- **search_web()**: Calls search engines (Tavily / DuckDuckGo / Jina) to obtain the initial URL list.
65+
- **read_with_jina_reader()**: Uses Jina Reader to crawl web page content and return structured Markdown-formatted text.
66+
67+
3. **Multi-Layer BFS Forest Exploration**
68+
- **Algorithm**: Adopts a Breadth-First Search (BFS) strategy to explore web links layer by layer. In each layer, Jina Reader is used to crawl page content, extract candidate URLs, and then URLSelector filters the most relevant links for the next layer.
69+
- **Key Parameters**:
70+
- `max_depth`: Maximum exploration depth (default 2)
71+
- `concurrent_limit`: Number of concurrent requests (default 10)
72+
- `topk_urls`: Number of URLs filtered per layer (default 5)
73+
- `url_timeout`: Request timeout (default 60 seconds)
74+
75+
4. **URLSelector**
76+
- **Responsibility**: Uses LLM to select the most relevant URLs from the candidate URL list based on the research objective.
77+
- **Filtering Strategy**: Analyzes URL relevance to the research objective, domain credibility, avoids duplicate content, and filters blocked domains.
78+
79+
5. **RAGManager**
80+
- **Responsibility**: Stores crawled web content into a vector database, supporting subsequent semantic retrieval and providing context for the SummaryAgent.
81+
82+
6. **SummaryAgent**
83+
- **Responsibility**: Generates specific download subtasks based on RAG-retrieved content.
84+
- **Output**: A structured subtask list, for example:
85+
```json
86+
{
87+
"type": "download",
88+
"objective": "Download Spider Text2SQL dataset",
89+
"search_keywords": ["spider dataset", "text2sql"],
90+
"platform_hint": "huggingface",
91+
"priority": 1
92+
}
93+
```
94+
95+
#### 2.2.2 WebCrawler Node
96+
97+
WebCrawler Node specializes in extracting code blocks and technical content from web pages. It runs in parallel with WebSearch Node, providing richer data sources.
98+
99+
1. **Generate Search Queries**: Creates specialized search queries targeting code/technical content.
100+
2. **Search & Crawl**: Searches the web for URL lists and uses Jina Reader for concurrent page crawling.
101+
3. **Code Block Extraction**: Calls `extract_code_blocks_from_markdown` to extract code blocks from Markdown content.
102+
4. **Save Results**: Stores crawled results as `webcrawler_crawled.jsonl`.
103+
104+
### 2.3 Data Download Phase
105+
106+
**Download Node** performs the actual dataset download tasks, supporting three download methods with LLM intelligently deciding the download priority order.
107+
108+
1. **DownloadMethodDecisionAgent (LLM Decision)**
109+
- **Responsibility**: Analyzes the best download method based on the task objective and outputs a priority list, e.g., `["huggingface", "kaggle", "web"]`.
110+
111+
2. **Try Each Download Method Sequentially**:
112+
- **HuggingFace**: Searches HuggingFace Hub, LLM selects the best matching dataset, and downloads via API.
113+
- **Kaggle**: Searches Kaggle datasets, LLM selects the best match, and downloads through the Kaggle API.
114+
- **Web**: Uses WebAgent for intelligent web exploration and direct file download.
115+
116+
3. **Record Download Results**: Updates `state.download_results` with the download status and path for each dataset.
117+
118+
### 2.4 Data Processing & Mapping Phase
119+
120+
#### Postprocess Node
121+
122+
- **Responsibility**: Checks whether there are remaining incomplete subtasks (`check_more_tasks`). If so, loops back to the collection phase; otherwise, proceeds to the mapping phase.
123+
124+
#### Mapping Node
125+
126+
Mapping Node is responsible for converting collected intermediate-format data into standard Alpaca format, using LLM to generate adaptive Python mapping functions.
127+
128+
1. **Read Intermediate Data**: Loads raw records from `intermediate.jsonl`.
129+
2. **LLM Generates Mapping Function (Triple Verification)**:
130+
1. Generates the mapping function 3 times.
131+
2. Validates consistency on sample data.
132+
3. Uses the function after passing verification.
133+
3. **Batch Processing**: Executes mapping transformation on all records.
134+
4. **Quality Filtering**: Applies quality filters to remove low-quality data.
135+
5. **Save Results**: Outputs in both `.jsonl` and `.json` formats.
136+
137+
**Alpaca Format Definition**:
138+
139+
```json
140+
{
141+
"instruction": "Task instruction or question",
142+
"input": "Optional input context (e.g., system prompt, SQL Schema)",
143+
"output": "Expected answer or output"
144+
}
145+
```
146+
147+
**SFT Data Mapping Rules**:
148+
- `system` role → `input` field
149+
- `user` role → `instruction` field
150+
- `assistant` role → `output` field
151+
152+
**Mapping Example (Text2SQL)**:
153+
154+
```json
155+
// Input format
156+
{
157+
"messages": [
158+
{"role": "system", "content": "CREATE TABLE farm (Id VARCHAR)"},
159+
{"role": "user", "content": "How many farms are there?"},
160+
{"role": "assistant", "content": "SELECT COUNT(*) FROM farm"}
161+
]
162+
}
163+
164+
// Output Alpaca format
165+
{
166+
"instruction": "How many farms are there?",
167+
"input": "CREATE TABLE farm (Id VARCHAR)",
168+
"output": "SELECT COUNT(*) FROM farm"
169+
}
170+
```
171+
172+
## 3. State Management & Output
173+
174+
### 3.1 WebCollectionState Core Fields
175+
176+
```python
177+
@dataclass
178+
class WebCollectionState(MainState):
179+
# Task related
180+
user_query: str # User's original requirement
181+
task_list: List[Dict] # Decomposed task list
182+
current_task_index: int # Current task index
183+
184+
# Search related
185+
research_summary: str # Research summary
186+
urls_visited: List[str] # Visited URLs
187+
subtasks: List[Dict] # Download subtasks
188+
189+
# Download related
190+
download_results: Dict # Download result statistics
191+
192+
# WebCrawler related
193+
webcrawler_crawled_pages: List # Crawled pages
194+
webcrawler_sft_records: List # SFT records
195+
webcrawler_pt_records: List # PT records
196+
197+
# Mapping related
198+
mapping_results: Dict # Mapping results
199+
intermediate_data_path: str # Intermediate data path
200+
```
201+
202+
### 3.2 WebCollectionRequest Configuration
203+
204+
```python
205+
@dataclass
206+
class WebCollectionRequest(MainRequest):
207+
# Task configuration
208+
category: str = "PT" # PT or SFT
209+
output_format: str = "alpaca"
210+
211+
# Search configuration
212+
search_engine: str = "tavily"
213+
max_depth: int = 2
214+
max_urls: int = 10
215+
concurrent_limit: int = 5
216+
topk_urls: int = 5
217+
218+
# WebCrawler configuration
219+
enable_webcrawler: bool = True
220+
webcrawler_num_queries: int = 5
221+
webcrawler_crawl_depth: int = 3
222+
webcrawler_concurrent_pages: int = 3
223+
```
224+
225+
### 3.3 Output File Structure
226+
227+
```
228+
web_collection_output/
229+
├── rag_db/ # RAG vector database
230+
├── hf_datasets/ # HuggingFace downloaded data
231+
│ └── dataset_name/
232+
├── kaggle_datasets/ # Kaggle downloaded data
233+
├── web_downloads/ # Direct web downloads
234+
├── webcrawler_output/ # WebCrawler crawled results
235+
│ └── webcrawler_crawled.jsonl
236+
├── processed_output/ # Post-processing results
237+
│ └── intermediate.jsonl
238+
└── mapped_output/ # Final mapping results
239+
├── final_alpaca_sft.jsonl # Alpaca format (JSONL)
240+
└── final_alpaca_sft.json # Alpaca format (JSON)
241+
```
242+
243+
## 4. User Guide
244+
245+
This feature provides two modes of usage: **Graphical Interface (Gradio UI)** and **Command Line Script**.
246+
247+
### 4.1 Graphical Interface
248+
249+
The front-end page code is located in `gradio_app/pages/web_collection.py`, providing a visual interactive experience. To launch the web interface:
250+
251+
```bash
252+
python gradio_app/app.py
253+
```
254+
255+
Visit `http://127.0.0.1:7860` to start using
256+
257+
![web_agent](/web_agent.png)
258+
259+
1. `step1:` Describe the type of data you want to collect in the "Target Description" field
260+
2. `step2:` Select the data category (PT or SFT)
261+
3. `step3:` Configure dataset quantity and size limits
262+
4. `step4:` Configure LLM API information (URL, Key, Model)
263+
5. `step5:` (Optional) Configure Kaggle, Tavily, and other service keys
264+
6. `step6:` Click the **"Start Web Collection & Conversion"** button
265+
7. `step7:` Monitor the execution logs in real time
266+
8. `step8:` Review the result summary after completion
267+
9. `step9:` Check the collected data in the download directory
268+
269+
**Advanced Usage**: Expand the "Advanced Configuration" section to adjust search engine selection, parallelism, caching strategy, data conversion parameters, etc.
270+
271+
### 4.2 Script Invocation
272+
273+
For automated tasks or batch collection, it is recommended to use the command line script `script/run_web_collection.py` directly.
274+
275+
#### 1. Environment Variable Configuration
276+
277+
```bash
278+
export DF_API_URL="https://api.openai.com/v1"
279+
export DF_API_KEY="your_api_key"
280+
export TAVILY_API_KEY="your_tavily_key"
281+
export KAGGLE_USERNAME=""
282+
export KAGGLE_KEY=""
283+
export RAG_API_URL=""
284+
export RAG_API_KEY=""
285+
```
286+
287+
#### 2. Run the Script
288+
289+
```bash
290+
# Basic usage
291+
python script/run_web_collection.py --target "Collect machine learning Q&A datasets"
292+
293+
# Full parameters
294+
python script/run_web_collection.py \
295+
--target "Collect code generation datasets" \
296+
--category SFT \
297+
--max-urls 10 \
298+
--max-depth 2 \
299+
--download-dir ./my_output
300+
```
301+
302+
**Main Parameter Description**:
303+
304+
- **`--target`**: Data collection target description (required)
305+
- **`--category`**: Data category, `PT` or `SFT` (default `SFT`)
306+
- **`--max-urls`**: Maximum number of URLs (default 10)
307+
- **`--max-depth`**: Maximum crawl depth (default 2)
308+
- **`--output-format`**: Output format (default `alpaca`)
309+
310+
#### 3. Python API Call
311+
312+
```python
313+
from dataflow_agent.workflow.wf_web_collection import run_web_collection
314+
315+
result = await run_web_collection(
316+
target="Collect machine learning code examples",
317+
category="SFT",
318+
output_format="alpaca",
319+
download_dir="./my_output",
320+
model="gpt-4o"
321+
)
322+
```
323+
324+
### 4.3 Practical Case: Collecting a Chinese Q&A Dataset
325+
326+
Suppose we need to build a Chinese Q&A training dataset for a chatbot. Here is the complete workflow.
327+
328+
**Scenario Configuration:**
329+
330+
```bash
331+
export DF_API_URL="https://api.openai.com/v1"
332+
export DF_API_KEY="your_api_key"
333+
export TAVILY_API_KEY="your_tavily_key"
334+
335+
python script/run_web_collection.py \
336+
--target "Collect Chinese Q&A datasets for fine-tuning" \
337+
--category SFT \
338+
--max-urls 20
339+
```
340+
341+
**Run:**
342+
After running the script, the workflow will execute in the following steps:
343+
344+
1. **Task Decomposition**: LLM decomposes "Collect Chinese Q&A datasets for fine-tuning" into multiple subtasks (e.g., Chinese common knowledge Q&A, Chinese reading comprehension, etc.).
345+
2. **Category Classification**: Based on the "fine-tuning" keyword, automatically classifies as SFT type.
346+
3. **Parallel Collection**: WebSearch explores Chinese QA datasets on platforms such as HuggingFace and GitHub; WebCrawler simultaneously crawls Q&A content from technical blogs.
347+
4. **Intelligent Download**: LLM decides to prioritize downloading matching datasets from HuggingFace, falling back to Kaggle and direct web download on failure.
348+
5. **Format Mapping**: Converts the downloaded heterogeneous data into unified Alpaca format, outputting to the `mapped_output/` directory.
349+
350+
Users can find the final `final_alpaca_sft.jsonl` file in the download directory, ready for direct use in model fine-tuning training.
351+
352+
### 4.4 Notes
353+
354+
1. **API Keys**
355+
- Ensure that necessary API keys are configured
356+
- Tavily is used for search; Kaggle is used for downloading Kaggle datasets
357+
358+
2. **Network Environment**
359+
- If located in China, it is recommended to use a HuggingFace mirror (set `HF_ENDPOINT`)
360+
- Adjust the parallelism to match your network bandwidth
361+
362+
3. **Storage Space**
363+
- Ensure sufficient disk space is available
364+
- Large datasets may require several GB of storage
365+
366+
4. **Execution Time**
367+
- The collection process may take a considerable amount of time (minutes to hours)
368+
- You can control the duration by limiting the number of download tasks
369+
370+
5. **Data Quality**
371+
- Enabling RAG enhancement can improve data quality
372+
- Adjust sampling parameters to balance quality and speed

0 commit comments

Comments
 (0)