Skip to content

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

This pull request refactors multiple baseline methods in the GraphGen project to simplify their interfaces, unify data handling logic, and improve code maintainability.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ChenZiHong-Gavin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant refactoring effort across multiple baseline methods within the GraphGen project. The core objective is to enhance code maintainability and simplify interfaces by centralizing LLM client management, standardizing data loading mechanisms, and consolidating related functionalities into more cohesive structures. This refactoring reduces boilerplate, improves error handling, and sets the stage for more consistent development of future baseline implementations.

Highlights

  • Centralized LLM Client: The OpenAIClient and Tokenizer from graphgen.models are now consistently used across all baseline implementations, replacing disparate LLM interaction methods.
  • EntiGraph Refactoring: The entigraph.py module has been completely restructured into a dataclass-based EntiGraph class, consolidating logic for entity, relation, and QA generation and removing several old helper files.
  • Standardized Data Input: The --data_type command-line argument has been removed from all baseline scripts, and a unified load_and_dedup_data function introduced in entigraph.py, simplifying how input data is handled.
  • Enhanced Robustness: Improved error handling and JSON parsing logic for LLM responses have been added, particularly in entigraph.py, to make the generation process more resilient to varied outputs.
  • Simplified Example Scripts: All generate_*.sh example scripts have been updated to reflect the streamlined command-line interface, removing the now-obsolete --data_type flag.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring of the baseline methods, which greatly improves code structure and maintainability, particularly in the EntiGraph baseline. The introduction of a class-based approach and the simplification of interfaces are excellent changes. However, the goal of unifying data handling logic seems incomplete, as EntiGraph uses a robust data loader while other baselines use a simpler, inconsistent method. There is also a critical bug in wrap.py where it fails to create the output directory before writing results. I've left specific comments on these issues and other areas for improvement.

with open(args.input_file, "r", encoding="utf-8") as f:
data = json.load(f)

results = wrap.generate(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This script is missing the logic to create the output directory before saving the results. Unlike the other refactored baseline scripts, this one will raise a FileNotFoundError if the output directory (e.g., cache/data/) does not exist. This is a critical bug. Please add the directory creation logic before writing to the file, like so:

    # Save results
    output_dir = os.path.dirname(args.output_file)
    if output_dir:
        os.makedirs(output_dir, exist_ok=True)

    with open(args.output_file, "w", encoding="utf-8") as f:
        json.dump(results, f, indent=4, ensure_ascii=False)

Comment on lines +130 to +131
with open(args.input_file, "r", encoding="utf-8") as f:
data = json.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The data loading logic here is a simple json.load(f), which is inconsistent with the more robust load_and_dedup_data function in baselines/EntiGraph/entigraph.py. To fulfill the PR's goal of unifying data handling, this logic should be extracted into a shared utility and used across all baseline scripts.

Comment on lines +97 to +98
with open(args.input_file, "r", encoding="utf-8") as f:
data = json.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The data loading logic here is a simple json.load(f), which is inconsistent with the more robust load_and_dedup_data function in baselines/EntiGraph/entigraph.py. For consistency and to meet the PR's goal of unifying data handling, a shared, robust data loading utility should be used across all baselines.

Comment on lines +167 to +168
with open(args.input_file, "r", encoding="utf-8") as f:
data = json.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script uses a simple json.load(f) for data loading, which is inconsistent with the more robust load_and_dedup_data function implemented in baselines/EntiGraph/entigraph.py. To truly unify data handling as stated in the PR description, this logic should be centralized in a shared utility and used here as well.

Comment on lines +114 to +115
with open(args.input_file, "r", encoding="utf-8") as f:
data = json.load(f)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This script uses a simple json.load(f) for data loading. This is inconsistent with the robust load_and_dedup_data function in baselines/EntiGraph/entigraph.py. To unify data handling across baselines, a shared data loading utility should be implemented and used here.

Comment on lines +185 to +187
for i, entity_i in enumerate(entities):
for j in range(i + 1, len(entities)):
temp_pairs.append((doc_text, entity_i, entities[j]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This nested loop for creating entity pairs can be made more concise and Pythonic by using itertools.combinations. Please also add from itertools import combinations to the top of the file.

            for entity_i, entities_j in combinations(entities, 2):
                temp_pairs.append((doc_text, entity_i, entities_j))

Comment on lines 325 to 326
if output_dir and not os.path.exists(output_dir):
os.makedirs(output_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check can be simplified by using the exist_ok=True argument in os.makedirs, which avoids raising an error if the directory already exists. This makes the code more concise.

    if output_dir:
        os.makedirs(output_dir, exist_ok=True)

ChenZiHong-Gavin and others added 7 commits December 24, 2025 00:38
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@ChenZiHong-Gavin ChenZiHong-Gavin merged commit 67e6c4a into main Dec 23, 2025
4 checks passed
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the fix/update-baseline branch December 23, 2025 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants