GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
📚 Table of Contents
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our paper and best practice.
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
Experience GraphGen through Web or Backup Web Entrance
For any questions, please check FAQ, open new issue or join our wechat group and ask.
python webui/app.py
-
Install GraphGen
pip install graphg
-
Run in CLI
SYNTHESIZER_MODEL=your_synthesizer_model_name \ SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \ SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \ TRAINEE_MODEL=your_trainee_model_name \ TRAINEE_BASE_URL=your_base_url_for_trainee_model \ TRAINEE_API_KEY=your_api_key_for_trainee_model \ graphg --output_dir cache
- Install dependencies
pip install -r requirements.txt
- Configure the environment
- Create an
.env
file in the root directorycp .env.example .env
- Set the following environment variables:
# Synthesizer is the model used to construct KG and generate data SYNTHESIZER_MODEL=your_synthesizer_model_name SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model # Trainee is the model used to train with the generated data TRAINEE_MODEL=your_trainee_model_name TRAINEE_BASE_URL=your_base_url_for_trainee_model TRAINEE_API_KEY=your_api_key_for_trainee_model
- Create an
- (Optional) If you want to modify the default generated configuration, you can edit the content of the configs/graphgen_config.yaml file.
# configs/graphgen_config.yaml # Example configuration data_type: "raw" input_file: "resources/examples/raw_demo.jsonl" # more configurations...
- Run the generation script
bash scripts/generate.sh
- Get the generated data
ls cache/data/graphgen
- Build the Docker image
docker build -t graphgen .
- Run the Docker container
docker run -p 7860:7860 graphgen
- 2025.04.21: We have released the initial version of GraphGen.
See analysis by deepwiki for a technical overview of the GraphGen system, its architecture, and core functionalities.
- SiliconCloud Abundant LLM API, some models are free
- LightRAG Simple and efficient graph retrieval solution
- ROGRAG ROGRAG: A Robustly Optimized GraphRAG Framework
If you find this repository useful, please consider citing our work:
@software{Chen_GraphGen_2025,
author = {Chen, Zihong and Jiang, Wanli and Li, Jingzhe and Yuan, Zhonghang and Wang, Chenyang and Kong, Huanjun and Dong, Nanqing},
month = apr,
title = {{GraphGen}},
url = {https://github.com/open-sciencelab/GraphGen},
year = {2025}
}
This project is licensed under the Apache License 2.0.