Skip to content

Commit 27ab285

Browse files
merge from main
2 parents e783736 + b31c1fb commit 27ab285

36 files changed

+1456
-133
lines changed

.env.example

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,29 @@
1+
# Tokenizer
12
TOKENIZER_MODEL=
2-
SYNTHESIZER_MODEL=
3+
4+
# LLM
5+
# Support different backends: http_api, openai_api, ollama_api, ollama, huggingface, tgi, sglang, tensorrt
6+
7+
# http_api / openai_api
8+
SYNTHESIZER_BACKEND=openai_api
9+
SYNTHESIZER_MODEL=gpt-4o-mini
310
SYNTHESIZER_BASE_URL=
411
SYNTHESIZER_API_KEY=
5-
TRAINEE_MODEL=
12+
TRAINEE_BACKEND=openai_api
13+
TRAINEE_MODEL=gpt-4o-mini
614
TRAINEE_BASE_URL=
715
TRAINEE_API_KEY=
16+
17+
# # ollama_api
18+
# SYNTHESIZER_BACKEND=ollama_api
19+
# SYNTHESIZER_MODEL=gemma3
20+
# SYNTHESIZER_BASE_URL=http://localhost:11434
21+
#
22+
# Note: TRAINEE with ollama_api backend is not supported yet as ollama_api does not support logprobs.
23+
24+
# # huggingface
25+
# SYNTHESIZER_BACKEND=huggingface
26+
# SYNTHESIZER_MODEL=Qwen/Qwen2.5-0.5B-Instruct
27+
#
28+
# TRAINEE_BACKEND=huggingface
29+
# TRAINEE_MODEL=Qwen/Qwen2.5-0.5B-Instruct

README.md

Lines changed: 39 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,14 @@
2121

2222
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
2323

24-
[English](README.md) | [中文](README_zh)
24+
[English](README.md) | [中文](README_zh.md)
2525

2626
<details close>
2727
<summary><b>📚 Table of Contents</b></summary>
2828

2929
- 📝 [What is GraphGen?](#-what-is-graphgen)
3030
- 📌 [Latest Updates](#-latest-updates)
31+
- ⚙️ [Support List](#-support-list)
3132
- 🚀 [Quick Start](#-quick-start)
3233
- 🏗️ [System Architecture](#-system-architecture)
3334
- 🍀 [Acknowledgements](#-acknowledgements)
@@ -47,13 +48,13 @@ GraphGen is a framework for synthetic data generation guided by knowledge graphs
4748

4849
Here is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.
4950

50-
| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
51-
| :-: | :-: | :-: | :-: |
52-
| Plant| [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
53-
| Common | CMMLU | 73.6 | **75.8** |
54-
| Knowledge | GPQA-Diamond | **40.0** | 33.3 |
55-
| Math | AIME24 | **20.6** | 16.7 |
56-
| | AIME25 | **22.7** | 7.2 |
51+
| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
52+
|:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:|
53+
| Plant | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
54+
| Common | CMMLU | 73.6 | **75.8** |
55+
| Knowledge | GPQA-Diamond | **40.0** | 33.3 |
56+
| Math | AIME24 | **20.6** | 16.7 |
57+
| | AIME25 | **22.7** | 7.2 |
5758

5859
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.
5960
Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
@@ -62,20 +63,48 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
6263

6364
## 📌 Latest Updates
6465

66+
- **2025.10.30**: We support several new LLM clients and inference backends including [Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py) and [SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
6567
- **2025.10.23**: We support VQA(Visual Question Answering) data generation now. Run script: `bash scripts/generate/generate_vqa.sh`.
6668
- **2025.10.21**: We support PDF as input format for data generation now via [MinerU](https://github.com/opendatalab/MinerU).
67-
- **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).
6869

6970
<details>
7071
<summary>History</summary>
7172

73+
- **2025.09.29**: We auto-update gradio demo on [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen) and [ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen).
7274
- **2025.08.14**: We have added support for community detection in knowledge graphs using the Leiden algorithm, enabling the synthesis of Chain-of-Thought (CoT) data.
7375
- **2025.07.31**: We have added Google, Bing, Wikipedia, and UniProt as search back-ends.
7476
- **2025.04.21**: We have released the initial version of GraphGen.
7577

7678
</details>
7779

7880

81+
## ⚙️ Support List
82+
83+
We support various LLM inference servers, API servers, inference clients, input file formats, data modalities, output data formats, and output data types.
84+
Users can flexibly configure according to the needs of synthetic data.
85+
86+
| Inference Server | Api Server | Inference Client | Input File Format | Data Modal | Data Format | Data Type |
87+
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|---------------|------------------------------|-------------------------------------------------|
88+
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | CSV<br>JSON<br>JSONL<br>PDF<br>TXT | TEXT<br>IMAGE | Alpaca<br>ChatML<br>Sharegpt | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
89+
90+
<!-- links -->
91+
[hf]: https://huggingface.co/docs/transformers/index
92+
[sg]: https://docs.sglang.ai
93+
[sif]: https://siliconflow.cn
94+
[oai]: https://openai.com
95+
[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
96+
[ol]: https://ollama.com
97+
98+
<!-- icons -->
99+
[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
100+
[sg-icon]: https://www.google.com/s2/favicons?domain=https://docs.sglang.ai
101+
[sif-icon]: https://www.google.com/s2/favicons?domain=siliconflow.com
102+
[oai-icon]: https://www.google.com/s2/favicons?domain=https://openai.com
103+
[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com
104+
[ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com
105+
106+
107+
79108
## 🚀 Quick Start
80109

81110
Experience GraphGen through [Web](https://g-app-center-120612-6433-jpdvmvp.openxlab.space) or [Backup Web Entrance](https://openxlab.org.cn/apps/detail/chenzihonga/GraphGen)
@@ -176,7 +205,7 @@ For any questions, please check [FAQ](https://github.com/open-sciencelab/GraphGe
176205
Pick the desired format and run the matching script:
177206

178207
| Format | Script to run | Notes |
179-
| ------------ | ---------------------------------------------- |-------------------------------------------------------------------|
208+
|--------------|------------------------------------------------|-------------------------------------------------------------------|
180209
| `cot` | `bash scripts/generate/generate_cot.sh` | Chain-of-Thought Q\&A pairs |
181210
| `atomic` | `bash scripts/generate/generate_atomic.sh` | Atomic Q\&A pairs covering basic knowledge |
182211
| `aggregated` | `bash scripts/generate/generate_aggregated.sh` | Aggregated Q\&A pairs incorporating complex, integrated knowledge |

README_zh.md

Lines changed: 43 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -20,19 +20,20 @@
2020

2121
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
2222

23-
[English](README.md) | [中文](README_zh)
23+
[English](README.md) | [中文](README_zh.md)
2424

2525
<details close>
2626
<summary><b>📚 目录</b></summary>
2727

2828
- 📝 [什么是 GraphGen?](#-什么是-graphgen)
29-
- 📌 [最新更新](#最新更新)
30-
- 🚀 [快速开始](#快速开始)
31-
- 🏗️ [系统架构](#系统架构)
32-
- 🍀 [致谢](#致谢)
33-
- 📚 [引用](#引用)
34-
- 📜 [许可证](#许可证)
35-
- 📅 [星标历史](#星标历史)
29+
- 📌 [最新更新](#-最新更新)
30+
- ⚙️ [支持列表](#-支持列表)
31+
- 🚀 [快速开始](#-快速开始)
32+
- 🏗️ [系统架构](#-系统架构)
33+
- 🍀 [致谢](#-致谢)
34+
- 📚 [引用](#-引用)
35+
- 📜 [许可证](#-许可证)
36+
- 📅 [星标历史](#-星标历史)
3637

3738

3839
[//]: # (- 🌟 [主要特性](#主要特性))
@@ -48,34 +49,59 @@ GraphGen 是一个基于知识图谱的数据合成框架。请查看[**论文**
4849

4950
以下是在超过 50 % 的 SFT 数据来自 GraphGen 及我们的数据清洗流程时的训练后结果:
5051

51-
| 领域 | 数据集 | 我们的方案 | Qwen2.5-7B-Instruct(基线) |
52-
| :-: | :-: | :-: | :-: |
53-
| 植物 | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
54-
| 常识 | CMMLU | 73.6 | **75.8** |
55-
| 知识 | GPQA-Diamond | **40.0** | 33.3 |
56-
| 数学 | AIME24 | **20.6** | 16.7 |
57-
| | AIME25 | **22.7** | 7.2 |
52+
| 领域 | 数据集 | 我们的方案 | Qwen2.5-7B-Instruct(基线) |
53+
|:--:|:---------------------------------------------------------:|:--------:|:-----------------------:|
54+
| 植物 | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
55+
| 常识 | CMMLU | 73.6 | **75.8** |
56+
| 知识 | GPQA-Diamond | **40.0** | 33.3 |
57+
| 数学 | AIME24 | **20.6** | 16.7 |
58+
| | AIME25 | **22.7** | 7.2 |
5859

5960
GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期望校准误差指标识别大语言模型中的知识缺口,优先生成针对高价值长尾知识的问答对。
6061
此外,GraphGen 采用多跳邻域采样捕获复杂关系信息,并使用风格控制生成来丰富问答数据的多样性。
6162

6263
在数据生成后,您可以使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)[xtuner](https://github.com/InternLM/xtuner)对大语言模型进行微调。
6364

6465
## 📌 最新更新
65-
66+
- **2025.10.30** 我们支持多种新的 LLM 客户端和推理后端,包括 [Ollama_client]([Ollama_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/ollama_client.py), [http_client](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/api/http_client.py), [HuggingFace Transformers](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/hf_wrapper.py)[SGLang](https://github.com/open-sciencelab/GraphGen/blob/main/graphgen/models/llm/local/sglang_wrapper.py).
6667
- **2025.10.23**:我们现在支持视觉问答(VQA)数据生成。运行脚本:`bash scripts/generate/generate_vqa.sh`
6768
- **2025.10.21**:我们现在通过 [MinerU](https://github.com/opendatalab/MinerU) 支持 PDF 作为数据生成的输入格式。
68-
- **2025.09.29**:我们在 [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen)[ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen) 上自动更新 Gradio 应用。
6969

7070
<details>
7171
<summary>历史更新</summary>
7272

73+
- **2025.09.29**:我们在 [Hugging Face](https://huggingface.co/spaces/chenzihong/GraphGen)[ModelScope](https://modelscope.cn/studios/chenzihong/GraphGen) 上自动更新 Gradio 应用。
7374
- **2025.08.14**:支持利用 Leiden 社区发现算法对知识图谱进行社区划分,合成 CoT 数据。
7475
- **2025.07.31**:新增 Google、Bing、Wikipedia 和 UniProt 作为搜索后端,帮助填补数据缺口。
7576
- **2025.04.21**:发布 GraphGen 初始版本。
7677

7778
</details>
7879

80+
## ⚙️ 支持列表
81+
82+
我们支持多种 LLM 推理服务器、API 服务器、推理客户端、输入文件格式、数据模态、输出数据格式和输出数据类型。
83+
可以根据合成数据的需求进行灵活配置。
84+
85+
| 推理服务器 | API 服务器 | 推理客户端 | 输入文件格式 | 数据模态 | 输出数据格式 | 输出数据类型 |
86+
|----------------------------------------------|--------------------------------------------------------------------------------|------------------------------------------------------------|------------------------------------|--------------|------------------------------|-------------------------------------------------|
87+
| [![hf-icon]HF][hf]<br>[![sg-icon]SGLang][sg] | [![sif-icon]Silicon][sif]<br>[![oai-icon]OpenAI][oai]<br>[![az-icon]Azure][az] | HTTP<br>[![ol-icon]Ollama][ol]<br>[![oai-icon]OpenAI][oai] | CSV<br>JSON<br>JSONL<br>PDF<br>TXT | TEXT<br>TEXT | Alpaca<br>ChatML<br>Sharegpt | Aggregated<br>Atomic<br>CoT<br>Multi-hop<br>VQA |
88+
89+
<!-- links -->
90+
[hf]: https://huggingface.co/docs/transformers/index
91+
[sg]: https://docs.sglang.ai
92+
[sif]: https://siliconflow.cn
93+
[oai]: https://openai.com
94+
[az]: https://azure.microsoft.com/en-us/services/cognitive-services/openai-service/
95+
[ol]: https://ollama.com
96+
97+
<!-- icons -->
98+
[hf-icon]: https://www.google.com/s2/favicons?domain=https://huggingface.co
99+
[sg-icon]: https://www.google.com/s2/favicons?domain=https://docs.sglang.ai
100+
[sif-icon]: https://www.google.com/s2/favicons?domain=siliconflow.com
101+
[oai-icon]: https://www.google.com/s2/favicons?domain=https://openai.com
102+
[az-icon]: https://www.google.com/s2/favicons?domain=https://azure.microsoft.com
103+
[ol-icon]: https://www.google.com/s2/favicons?domain=https://ollama.com
104+
79105

80106
## 🚀 快速开始
81107

graphgen/bases/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
from .base_generator import BaseGenerator
22
from .base_kg_builder import BaseKGBuilder
3-
from .base_llm_client import BaseLLMClient
3+
from .base_llm_wrapper import BaseLLMWrapper
44
from .base_partitioner import BasePartitioner
55
from .base_reader import BaseReader
66
from .base_splitter import BaseSplitter

graphgen/bases/base_generator.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
from abc import ABC, abstractmethod
22
from typing import Any
33

4-
from graphgen.bases.base_llm_client import BaseLLMClient
4+
from graphgen.bases.base_llm_wrapper import BaseLLMWrapper
55

66

77
class BaseGenerator(ABC):
88
"""
99
Generate QAs based on given prompts.
1010
"""
1111

12-
def __init__(self, llm_client: BaseLLMClient):
12+
def __init__(self, llm_client: BaseLLMWrapper):
1313
self.llm_client = llm_client
1414

1515
@staticmethod

graphgen/bases/base_kg_builder.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,13 @@
22
from collections import defaultdict
33
from typing import Dict, List, Tuple
44

5-
from graphgen.bases.base_llm_client import BaseLLMClient
5+
from graphgen.bases.base_llm_wrapper import BaseLLMWrapper
66
from graphgen.bases.base_storage import BaseGraphStorage
77
from graphgen.bases.datatypes import Chunk
88

99

1010
class BaseKGBuilder(ABC):
11-
def __init__(self, llm_client: BaseLLMClient):
11+
def __init__(self, llm_client: BaseLLMWrapper):
1212
self.llm_client = llm_client
1313
self._nodes: Dict[str, List[dict]] = defaultdict(list)
1414
self._edges: Dict[Tuple[str, str], List[dict]] = defaultdict(list)
Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
from graphgen.bases.datatypes import Token
99

1010

11-
class BaseLLMClient(abc.ABC):
11+
class BaseLLMWrapper(abc.ABC):
1212
"""
1313
LLM client base class, agnostic to specific backends (OpenAI / Ollama / ...).
1414
"""
@@ -66,3 +66,9 @@ def filter_think_tags(text: str, think_tag: str = "think") -> str:
6666
think_pattern = re.compile(rf"<{think_tag}>.*?</{think_tag}>", re.DOTALL)
6767
filtered_text = think_pattern.sub("", text).strip()
6868
return filtered_text if filtered_text else text.strip()
69+
70+
def shutdown(self) -> None:
71+
"""Shutdown the LLM engine if applicable."""
72+
73+
def restart(self) -> None:
74+
"""Reinitialize the LLM engine if applicable."""

graphgen/graphgen.py

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
import gradio as gr
77

8+
from graphgen.bases import BaseLLMWrapper
89
from graphgen.bases.base_storage import StorageNameSpace
910
from graphgen.bases.datatypes import Chunk
1011
from graphgen.models import (
@@ -18,6 +19,7 @@
1819
build_kg,
1920
chunk_documents,
2021
generate_qas,
22+
init_llm,
2123
judge_statement,
2224
partition_kg,
2325
quiz,
@@ -39,30 +41,18 @@ def __init__(
3941
trainee_llm_client: OpenAIClient = None,
4042
progress_bar: gr.Progress = None,
4143
):
42-
self.unique_id = unique_id
43-
self.working_dir = working_dir
44+
self.unique_id: int = unique_id
45+
self.working_dir: str = working_dir
4446

4547
# llm
4648
self.tokenizer_instance: Tokenizer = tokenizer_instance or Tokenizer(
4749
model_name=os.getenv("TOKENIZER_MODEL")
4850
)
4951

50-
self.synthesizer_llm_client: OpenAIClient = (
51-
synthesizer_llm_client
52-
or OpenAIClient(
53-
model_name=os.getenv("SYNTHESIZER_MODEL"),
54-
api_key=os.getenv("SYNTHESIZER_API_KEY"),
55-
base_url=os.getenv("SYNTHESIZER_BASE_URL"),
56-
tokenizer=self.tokenizer_instance,
57-
)
58-
)
59-
60-
self.trainee_llm_client: OpenAIClient = trainee_llm_client or OpenAIClient(
61-
model_name=os.getenv("TRAINEE_MODEL"),
62-
api_key=os.getenv("TRAINEE_API_KEY"),
63-
base_url=os.getenv("TRAINEE_BASE_URL"),
64-
tokenizer=self.tokenizer_instance,
52+
self.synthesizer_llm_client: BaseLLMWrapper = (
53+
synthesizer_llm_client or init_llm("synthesizer")
6554
)
55+
self.trainee_llm_client: BaseLLMWrapper = trainee_llm_client
6656

6757
self.full_docs_storage: JsonKVStorage = JsonKVStorage(
6858
self.working_dir, namespace="full_docs"
@@ -210,16 +200,29 @@ async def quiz_and_judge(self, quiz_and_judge_config: Dict):
210200
)
211201

212202
# TODO: assert trainee_llm_client is valid before judge
203+
if not self.trainee_llm_client:
204+
# TODO: shutdown existing synthesizer_llm_client properly
205+
logger.info("No trainee LLM client provided, initializing a new one.")
206+
self.synthesizer_llm_client.shutdown()
207+
self.trainee_llm_client = init_llm("trainee")
208+
213209
re_judge = quiz_and_judge_config["re_judge"]
214210
_update_relations = await judge_statement(
215211
self.trainee_llm_client,
216212
self.graph_storage,
217213
self.rephrase_storage,
218214
re_judge,
219215
)
216+
220217
await self.rephrase_storage.index_done_callback()
221218
await _update_relations.index_done_callback()
222219

220+
logger.info("Shutting down trainee LLM client.")
221+
self.trainee_llm_client.shutdown()
222+
self.trainee_llm_client = None
223+
logger.info("Restarting synthesizer LLM client.")
224+
self.synthesizer_llm_client.restart()
225+
223226
@async_to_sync_method
224227
async def generate(self, partition_config: Dict, generate_config: Dict):
225228
# Step 1: partition the graph

graphgen/models/__init__.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@
77
VQAGenerator,
88
)
99
from .kg_builder import LightRAGKGBuilder, MMKGBuilder
10-
from .llm.openai_client import OpenAIClient
11-
from .llm.topk_token_model import TopkTokenModel
10+
from .llm import HTTPClient, OllamaClient, OpenAIClient
1211
from .partitioner import (
1312
AnchorBFSPartitioner,
1413
BFSPartitioner,

0 commit comments

Comments
 (0)