refactor: refactor baselines #138

ChenZiHong-Gavin · 2025-12-23T16:33:39Z

This pull request refactors multiple baseline methods in the GraphGen project to simplify their interfaces, unify data handling logic, and improve code maintainability.

gemini-code-assist · 2025-12-23T16:34:12Z

Summary of Changes

Hello @ChenZiHong-Gavin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring effort across multiple baseline methods within the GraphGen project. The core objective is to enhance code maintainability and simplify interfaces by centralizing LLM client management, standardizing data loading mechanisms, and consolidating related functionalities into more cohesive structures. This refactoring reduces boilerplate, improves error handling, and sets the stage for more consistent development of future baseline implementations.

Highlights

Centralized LLM Client: The OpenAIClient and Tokenizer from graphgen.models are now consistently used across all baseline implementations, replacing disparate LLM interaction methods.
EntiGraph Refactoring: The entigraph.py module has been completely restructured into a dataclass-based EntiGraph class, consolidating logic for entity, relation, and QA generation and removing several old helper files.
Standardized Data Input: The --data_type command-line argument has been removed from all baseline scripts, and a unified load_and_dedup_data function introduced in entigraph.py, simplifying how input data is handled.
Enhanced Robustness: Improved error handling and JSON parsing logic for LLM responses have been added, particularly in entigraph.py, to make the generation process more resilient to varied outputs.
Simplified Example Scripts: All generate_*.sh example scripts have been updated to reflect the streamlined command-line interface, removing the now-obsolete --data_type flag.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a significant refactoring of the baseline methods, which greatly improves code structure and maintainability, particularly in the EntiGraph baseline. The introduction of a class-based approach and the simplification of interfaces are excellent changes. However, the goal of unifying data handling logic seems incomplete, as EntiGraph uses a robust data loader while other baselines use a simpler, inconsistent method. There is also a critical bug in wrap.py where it fails to create the output directory before writing results. I've left specific comments on these issues and other areas for improvement.

gemini-code-assist · 2025-12-23T16:37:50Z

baselines/Wrap/wrap.py

+    with open(args.input_file, "r", encoding="utf-8") as f:
+        data = json.load(f)

    results = wrap.generate(data)


This script is missing the logic to create the output directory before saving the results. Unlike the other refactored baseline scripts, this one will raise a FileNotFoundError if the output directory (e.g., cache/data/) does not exist. This is a critical bug. Please add the directory creation logic before writing to the file, like so:

# Save results output_dir = os.path.dirname(args.output_file) if output_dir: os.makedirs(output_dir, exist_ok=True) with open(args.output_file, "w", encoding="utf-8") as f: json.dump(results, f, indent=4, ensure_ascii=False)

gemini-code-assist · 2025-12-23T16:37:50Z

baselines/Genie/genie.py

+    with open(args.input_file, "r", encoding="utf-8") as f:
+        data = json.load(f)


The data loading logic here is a simple json.load(f), which is inconsistent with the more robust load_and_dedup_data function in baselines/EntiGraph/entigraph.py. To fulfill the PR's goal of unifying data handling, this logic should be extracted into a shared utility and used across all baseline scripts.

gemini-code-assist · 2025-12-23T16:37:50Z

baselines/LongForm/longform.py

+    with open(args.input_file, "r", encoding="utf-8") as f:
+        data = json.load(f)


The data loading logic here is a simple json.load(f), which is inconsistent with the more robust load_and_dedup_data function in baselines/EntiGraph/entigraph.py. For consistency and to meet the PR's goal of unifying data handling, a shared, robust data loading utility should be used across all baselines.

gemini-code-assist · 2025-12-23T16:37:50Z

baselines/SELF-QA/self-qa.py

+    with open(args.input_file, "r", encoding="utf-8") as f:
+        data = json.load(f)


This script uses a simple json.load(f) for data loading, which is inconsistent with the more robust load_and_dedup_data function implemented in baselines/EntiGraph/entigraph.py. To truly unify data handling as stated in the PR description, this logic should be centralized in a shared utility and used here as well.

gemini-code-assist · 2025-12-23T16:37:50Z

baselines/Wrap/wrap.py

+    with open(args.input_file, "r", encoding="utf-8") as f:
+        data = json.load(f)


This script uses a simple json.load(f) for data loading. This is inconsistent with the robust load_and_dedup_data function in baselines/EntiGraph/entigraph.py. To unify data handling across baselines, a shared data loading utility should be implemented and used here.

gemini-code-assist · 2025-12-23T16:37:50Z

baselines/EntiGraph/entigraph.py

+            for i, entity_i in enumerate(entities):
+                for j in range(i + 1, len(entities)):
+                    temp_pairs.append((doc_text, entity_i, entities[j]))


This nested loop for creating entity pairs can be made more concise and Pythonic by using itertools.combinations. Please also add from itertools import combinations to the top of the file.

for entity_i, entities_j in combinations(entities, 2): temp_pairs.append((doc_text, entity_i, entities_j))

gemini-code-assist · 2025-12-23T16:37:50Z

baselines/EntiGraph/entigraph.py

+    if output_dir and not os.path.exists(output_dir):
+        os.makedirs(output_dir)


This check can be simplified by using the exist_ok=True argument in os.makedirs, which avoids raising an error if the directory already exists. This makes the code more concise.

if output_dir: os.makedirs(output_dir, exist_ok=True)

baselines/Genie/genie.py

baselines/LongForm/longform.py

baselines/SELF-QA/self-qa.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

refactor: refactor baselines

f2f9d7f

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

ChenZiHong-Gavin and others added 7 commits December 24, 2025 00:38

fix: fix lint problems

f79895d

Update baselines/SELF-QA/self-qa.py

7d4328c

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update baselines/LongForm/longform.py

91db558

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update baselines/Genie/genie.py

da51c35

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update entigraph.py

5dadbdb

Update wrap.py

8f03e14

fix: fix lint problem

f39644b

ChenZiHong-Gavin merged commit 67e6c4a into main Dec 23, 2025
4 checks passed

ChenZiHong-Gavin deleted the fix/update-baseline branch December 23, 2025 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: refactor baselines #138

refactor: refactor baselines #138

Uh oh!

ChenZiHong-Gavin commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

gemini-code-assist bot Dec 23, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		with open(args.input_file, "r", encoding="utf-8") as f:
		data = json.load(f)

		if output_dir and not os.path.exists(output_dir):
		os.makedirs(output_dir)

refactor: refactor baselines #138

refactor: refactor baselines #138

Uh oh!

Conversation

ChenZiHong-Gavin commented Dec 23, 2025

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants