Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
f9dc25f
feat: add DFSPartitioner & BFSPartitioner
ChenZiHong-Gavin Oct 11, 2025
c0fa9f9
tests: add BFSPartitioner tests
ChenZiHong-Gavin Oct 11, 2025
4762477
Merge branch 'main' into partitioner
ChenZiHong-Gavin Oct 11, 2025
69e0d6a
feat: add community2batch method
ChenZiHong-Gavin Oct 13, 2025
b4431a2
feat: add AtomicGenerator
ChenZiHong-Gavin Oct 13, 2025
e462ce0
Merge branch 'partitioner' of https://github.com/open-sciencelab/Grap…
ChenZiHong-Gavin Oct 13, 2025
7d4f1e5
fix: fix typing
ChenZiHong-Gavin Oct 13, 2025
ed27cd1
refactor: adjust templates
ChenZiHong-Gavin Oct 13, 2025
a0de4a5
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin Oct 13, 2025
138d55b
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin Oct 13, 2025
2cd2e68
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin Oct 13, 2025
43c8196
Update tests/integration_tests/models/partitioner/test_dfs_partitione…
ChenZiHong-Gavin Oct 13, 2025
982ece5
Update graphgen/models/generator/atomic_generator.py
ChenZiHong-Gavin Oct 13, 2025
4c683b8
Update graphgen/models/generator/atomic_generator.py
ChenZiHong-Gavin Oct 13, 2025
dcae889
Merge branch 'partitioner' of https://github.com/open-sciencelab/Grap…
ChenZiHong-Gavin Oct 13, 2025
9ed56a6
feat: add ECEPartitioner
ChenZiHong-Gavin Oct 14, 2025
f072c2e
tests: add tests for ECEPartitioner
ChenZiHong-Gavin Oct 14, 2025
55667d7
feat: add AggregatedGenerator
ChenZiHong-Gavin Oct 14, 2025
c5958f8
fix: add param min_units_per_community
ChenZiHong-Gavin Oct 14, 2025
28a3584
feat: add MultiHopGenerator
ChenZiHong-Gavin Oct 14, 2025
a132627
Update tests/integration_tests/models/partitioner/test_ece_partitione…
ChenZiHong-Gavin Oct 14, 2025
8d71fdc
feat: add CoTGenerator
ChenZiHong-Gavin Oct 14, 2025
119718a
fix: fix TPM & RPM
ChenZiHong-Gavin Oct 14, 2025
c356c4f
fix: fix chinese tranlation
ChenZiHong-Gavin Oct 14, 2025
b4eed5b
Update graphgen/models/llm/limitter.py
ChenZiHong-Gavin Oct 14, 2025
1d4ef67
fix: use constants to distinguish nodes & edges
ChenZiHong-Gavin Oct 15, 2025
d80364d
Update tests/integration_tests/models/partitioner/test_dfs_partitione…
ChenZiHong-Gavin Oct 15, 2025
c986614
Update graphgen/evaluate.py
ChenZiHong-Gavin Oct 15, 2025
f243708
Update graphgen/models/partitioner/ece_partitioner.py
ChenZiHong-Gavin Oct 15, 2025
b964bcf
fix: move the “current usage” log line before adding the new request’…
ChenZiHong-Gavin Oct 15, 2025
55692c5
fix: use constants to distinguish nodes & edges
ChenZiHong-Gavin Oct 15, 2025
d53cfe2
Update graphgen/models/llm/openai_client.py
ChenZiHong-Gavin Oct 15, 2025
e601068
fix: log TPM counter before increment to show actual value that trigg…
ChenZiHong-Gavin Oct 15, 2025
fed4baa
Merge branch 'partitioner' of https://github.com/open-sciencelab/Grap…
ChenZiHong-Gavin Oct 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ max-public-methods=20
max-returns=6

# Maximum number of statements in function / method body.
max-statements=50
max-statements=60

# Minimum number of public methods for a class (see R0903).
min-public-methods=2
Expand Down
2 changes: 2 additions & 0 deletions graphgen/bases/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from .base_generator import BaseGenerator
from .base_kg_builder import BaseKGBuilder
from .base_llm_client import BaseLLMClient
from .base_partitioner import BasePartitioner
from .base_reader import BaseReader
from .base_splitter import BaseSplitter
from .base_storage import (
Expand Down
84 changes: 84 additions & 0 deletions graphgen/bases/base_generator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any

from graphgen.bases.base_llm_client import BaseLLMClient


@dataclass
class BaseGenerator(ABC):
"""
Generate QAs based on given prompts.
"""

llm_client: BaseLLMClient

@staticmethod
@abstractmethod
def build_prompt(
batch: tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]
) -> str:
"""Build prompt for LLM based on the given batch"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles.


@staticmethod
@abstractmethod
def parse_response(response: str) -> Any:
"""Parse the LLM response and return the generated QAs"""

async def generate(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability.

More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability.

More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability.

More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability.

More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved.

self,
batch: tuple[
list[tuple[str, dict]], list[tuple[Any, Any, dict] | tuple[Any, Any, Any]]
],
) -> dict[str, Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]).

If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]).

If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]).

If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]).

If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]).

If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples.

"""
Generate QAs based on a given batch.
:param batch
:return: QA pairs
"""
result = {}
prompt = self.build_prompt(batch)
response = await self.llm_client.generate_answer(prompt)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries.

To fix this, clarify the expected structure of qa_pairs:

  1. If qa_pairs is a single dictionary: The return type of parse_response should be dict[str, Any], and format_generation_results would need to be called with [qa_pairs] if it#x27s designed to process a list.
  2. If qa_pairs is a list of dictionaries: The generate method#x27s return type should likely be list[dict[str, Any]], and this result initialization and update logic needs to be rethought. Perhaps result should be a list, and result.extend(qa_pairs).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries.

To fix this, clarify the expected structure of qa_pairs:

  1. If qa_pairs is a single dictionary: The return type of parse_response should be dict[str, Any], and format_generation_results would need to be called with [qa_pairs] if it#x27s designed to process a list.
  2. If qa_pairs is a list of dictionaries: The generate method#x27s return type should likely be list[dict[str, Any]], and this result initialization and update logic needs to be rethought. Perhaps result should be a list, and result.extend(qa_pairs).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries.

To fix this, clarify the expected structure of qa_pairs:

  1. If qa_pairs is a single dictionary: The return type of parse_response should be dict[str, Any], and format_generation_results would need to be called with [qa_pairs] if it#x27s designed to process a list.
  2. If qa_pairs is a list of dictionaries: The generate method#x27s return type should likely be list[dict[str, Any]], and this result initialization and update logic needs to be rethought. Perhaps result should be a list, and result.extend(qa_pairs).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries.

To fix this, clarify the expected structure of qa_pairs:

  1. If qa_pairs is a single dictionary: The return type of parse_response should be dict[str, Any], and format_generation_results would need to be called with [qa_pairs] if it#x27s designed to process a list.
  2. If qa_pairs is a list of dictionaries: The generate method#x27s return type should likely be list[dict[str, Any]], and this result initialization and update logic needs to be rethought. Perhaps result should be a list, and result.extend(qa_pairs).

qa_pairs = self.parse_response(response) # generate one or more QA pairs
result.update(qa_pairs)
return result

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs.

If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError.

To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs.

If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError.

To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs.

If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError.

To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs.

If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError.

To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs.

If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError.

To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error.

@staticmethod

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]).

Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]).

Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]).

Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]).

Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output.

def format_generation_results(
results: list[dict], output_data_format: str
) -> list[dict[str, Any]]:
if output_data_format == "Alpaca":
results = [
{
"instruction": v["question"],
"input": "",
"output": v["answer"],
}
for item in results
for k, v in item.items()
]
elif output_data_format == "Sharegpt":
results = [
{
"conversations": [
{"from": "human", "value": v["question"]},
{"from": "gpt", "value": v["answer"]},
]
}
for item in results
for k, v in item.items()
]
elif output_data_format == "ChatML":
results = [
{
"messages": [
{"role": "user", "content": v["question"]},
{"role": "assistant", "content": v["answer"]},
]
}
for item in results
for k, v in item.items()
]
else:
raise ValueError(f"Unknown output data format: {output_data_format}")
return results
76 changes: 76 additions & 0 deletions graphgen/bases/base_partitioner.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Any, List

from graphgen.bases.base_storage import BaseGraphStorage
from graphgen.bases.datatypes import Community


@dataclass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface.

class BasePartitioner(ABC):
@abstractmethod
async def partition(
self,
g: BaseGraphStorage,
**kwargs: Any,
) -> List[Community]:
"""
Graph -> Communities
:param g: Graph storage instance
:param kwargs: Additional parameters for partitioning
:return: List of communities
"""

@staticmethod
async def community2batch(
communities: List[Community], g: BaseGraphStorage
) -> list[
tuple[
list[tuple[str, dict]], list[tuple[Any, Any, dict] | tuple[Any, Any, Any]]
]
]:
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior.

Convert communities to batches of nodes and edges.
:param communities
:param g: Graph storage instance
:return: List of batches, each batch is a tuple of (nodes, edges)
"""
batches = []
for comm in communities:
nodes = comm.nodes
edges = comm.edges
nodes_data = []
for node in nodes:
node_data = await g.get_node(node)
if node_data:
nodes_data.append((node, node_data))
edges_data = []
for u, v in edges:
edge_data = await g.get_edge(u, v)
if edge_data:
edges_data.append((u, v, edge_data))
else:
edge_data = await g.get_edge(v, u)
if edge_data:
edges_data.append((v, u, edge_data))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}.

If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}.

If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}.

If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}.

If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes.

batches.append((nodes_data, edges_data))
return batches

@staticmethod
def _build_adjacency_list(
nodes: List[tuple[str, dict]], edges: List[tuple[str, str, dict]]
) -> tuple[dict[str, List[str]], set[tuple[str, str]]]:
"""
Build adjacency list and edge set from nodes and edges.
:param nodes
:param edges
:return: adjacency list, edge set
"""
adj: dict[str, List[str]] = {n[0]: [] for n in nodes}
edge_set: set[tuple[str, str]] = set()
for e in edges:
adj[e[0]].append(e[1])
adj[e[1]].append(e[0])
edge_set.add((e[0], e[1]))
edge_set.add((e[1], e[0]))
return adj, edge_set
4 changes: 2 additions & 2 deletions graphgen/bases/base_storage.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ async def get_node(self, node_id: str) -> Union[dict, None]:
async def update_node(self, node_id: str, node_data: dict[str, str]):
raise NotImplementedError

async def get_all_nodes(self) -> Union[list[dict], None]:
async def get_all_nodes(self) -> Union[list[tuple[str, dict]], None]:
raise NotImplementedError

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method.


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage.

Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage.

Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage.

Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage.

Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage.

Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers.

async def get_edge(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability.

However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability.

However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability.

However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability.

However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format.

Expand All @@ -91,7 +91,7 @@ async def update_edge(
):
raise NotImplementedError

async def get_all_edges(self) -> Union[list[dict], None]:
async def get_all_edges(self) -> Union[list[tuple[str, str, dict]], None]:
raise NotImplementedError

async def get_node_edges(
Expand Down
8 changes: 8 additions & 0 deletions graphgen/bases/datatypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,11 @@ class Token:
@property
def logprob(self) -> float:
return math.log(self.prob)


@dataclass
class Community:
id: Union[int, str]
nodes: List[str] = field(default_factory=list)
edges: List[tuple] = field(default_factory=list)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)).

Consider making this more specific for better type checking and clarity, for example:

  • List[Tuple[str, str]] for unweighted edges.
  • List[Tuple[str, str, Any]] if edges can have weights or other attributes.

This improves maintainability by making the expected structure of edge data explicit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)).

Consider making this more specific for better type checking and clarity, for example:

  • List[Tuple[str, str]] for unweighted edges.
  • List[Tuple[str, str, Any]] if edges can have weights or other attributes.

This improves maintainability by making the expected structure of edge data explicit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)).

Consider making this more specific for better type checking and clarity, for example:

  • List[Tuple[str, str]] for unweighted edges.
  • List[Tuple[str, str, Any]] if edges can have weights or other attributes.

This improves maintainability by making the expected structure of edge data explicit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)).

Consider making this more specific for better type checking and clarity, for example:

  • List[Tuple[str, str]] for unweighted edges.
  • List[Tuple[str, str, Any]] if edges can have weights or other attributes.

This improves maintainability by making the expected structure of edge data explicit.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)).

Consider making this more specific for better type checking and clarity, for example:

  • List[Tuple[str, str]] for unweighted edges.
  • List[Tuple[str, str, Any]] if edges can have weights or other attributes.

This improves maintainability by making the expected structure of edge data explicit.

metadata: dict = field(default_factory=dict)
12 changes: 4 additions & 8 deletions graphgen/configs/aggregated_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,10 @@ quiz_and_judge: # quiz and test whether the LLM masters the knowledge points
partition: # graph partition configuration
method: ece # ece is a custom partition method based on comprehension loss
method_params:
bidirectional: true # whether to traverse the graph in both directions
edge_sampling: max_loss # edge sampling strategy, support: random, max_loss, min_loss
expand_method: max_width # expand method, support: max_width, max_depth
isolated_node_strategy: ignore # strategy for isolated nodes, support: ignore, add
max_depth: 5 # maximum depth for graph traversal
max_extra_edges: 20 # max edges per direction (if expand_method="max_width")
max_tokens: 256 # restricts input length (if expand_method="max_tokens")
loss_strategy: only_edge # defines loss computation focus, support: only_edge, both
max_units_per_community: 20 # max nodes and edges per community
min_units_per_community: 5 # min nodes and edges per community
max_tokens_per_community: 10240 # max tokens per community
unit_sampling: max_loss # edge sampling strategy, support: random, max_loss, min_loss
generate:
mode: aggregated # atomic, aggregated, multi_hop, cot
data_format: ChatML # Alpaca, Sharegpt, ChatML
11 changes: 2 additions & 9 deletions graphgen/configs/atomic_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,9 @@ quiz_and_judge: # quiz and test whether the LLM masters the knowledge points
quiz_samples: 2 # number of quiz samples to generate
re_judge: false # whether to re-judge the existing quiz samples
partition: # graph partition configuration
method: ece # ece is a custom partition method based on comprehension loss
method: dfs # partition method, support: dfs, bfs, ece, leiden
method_params:
bidirectional: true # whether to traverse the graph in both directions
edge_sampling: max_loss # edge sampling strategy, support: random, max_loss, min_loss
expand_method: max_width # expand method, support: max_width, max_depth
isolated_node_strategy: ignore # strategy for isolated nodes, support: ignore, add
max_depth: 3 # maximum depth for graph traversal
max_extra_edges: 5 # max edges per direction (if expand_method="max_width")
max_tokens: 256 # restricts input length (if expand_method="max_tokens")
loss_strategy: only_edge # defines loss computation focus, support: only_edge, both
max_units_per_community: 1 # atomic partition, one node or edge per community
generate:
mode: atomic # atomic, aggregated, multi_hop, cot
data_format: Alpaca # Alpaca, Sharegpt, ChatML
6 changes: 3 additions & 3 deletions graphgen/configs/cot_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ search: # web search configuration
quiz_and_judge: # quiz and test whether the LLM masters the knowledge points
enabled: false
partition: # graph partition configuration
method: leiden # leiden is a community detection algorithm
method: leiden # leiden is a partitioner detection algorithm
method_params:
max_size: 20 # Maximum size of communities
use_lcc: false
random_seed: 42
use_lcc: false # whether to use the largest connected component
random_seed: 42 # random seed for partitioning
generate:
mode: cot # atomic, aggregated, multi_hop, cot
data_format: Sharegpt # Alpaca, Sharegpt, ChatML
12 changes: 4 additions & 8 deletions graphgen/configs/multi_hop_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,10 @@ quiz_and_judge: # quiz and test whether the LLM masters the knowledge points
partition: # graph partition configuration
method: ece # ece is a custom partition method based on comprehension loss
method_params:
bidirectional: true # whether to traverse the graph in both directions
edge_sampling: max_loss # edge sampling strategy, support: random, max_loss, min_loss
expand_method: max_width # expand method, support: max_width, max_depth
isolated_node_strategy: ignore # strategy for isolated nodes, support: ignore, add
max_depth: 1 # maximum depth for graph traversal
max_extra_edges: 2 # max edges per direction (if expand_method="max_width")
max_tokens: 256 # restricts input length (if expand_method="max_tokens")
loss_strategy: only_edge # defines loss computation focus, support: only_edge, both
max_units_per_community: 3 # max nodes and edges per community, for multi-hop, we recommend setting it to 3
min_units_per_community: 3 # min nodes and edges per community, for multi-hop, we recommend setting it to 3
max_tokens_per_community: 10240 # max tokens per community
unit_sampling: random # edge sampling strategy, support: random, max_loss, min_loss
generate:
mode: multi_hop # strategy for generating multi-hop QA pairs
data_format: ChatML # Alpaca, Sharegpt, ChatML
66 changes: 14 additions & 52 deletions graphgen/graphgen.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,21 +18,14 @@
from graphgen.operators import (
build_kg,
chunk_documents,
generate_cot,
generate_qas,
judge_statement,
partition_kg,
quiz,
read_files,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers.

search_all,
traverse_graph_for_aggregated,
traverse_graph_for_atomic,
traverse_graph_for_multi_hop,
)
from graphgen.utils import (
async_to_sync_method,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial.

compute_content_hash,
format_generation_results,
logger,
)
from graphgen.utils import async_to_sync_method, compute_content_hash, logger

sys_path = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness.

Expand Down Expand Up @@ -238,51 +231,20 @@ async def quiz_and_judge(self, quiz_and_judge_config: Dict):
@async_to_sync_method
async def generate(self, partition_config: Dict, generate_config: Dict):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas.

Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior.

Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas.

Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior.

Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas.

Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior.

Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas.

Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior.

Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas.

Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior.

Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas.

Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior.

Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas.

Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior.

Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected.

# Step 1: partition the graph
# TODO: implement graph partitioning, e.g. Partitioner().partition(self.graph_storage)
mode = generate_config["mode"]
if mode == "atomic":
results = await traverse_graph_for_atomic(
self.synthesizer_llm_client,
self.tokenizer_instance,
self.graph_storage,
partition_config["method_params"],
self.text_chunks_storage,
self.progress_bar,
)
elif mode == "multi_hop":
results = await traverse_graph_for_multi_hop(
self.synthesizer_llm_client,
self.tokenizer_instance,
self.graph_storage,
partition_config["method_params"],
self.text_chunks_storage,
self.progress_bar,
)
elif mode == "aggregated":
results = await traverse_graph_for_aggregated(
self.synthesizer_llm_client,
self.tokenizer_instance,
self.graph_storage,
partition_config["method_params"],
self.text_chunks_storage,
self.progress_bar,
)
elif mode == "cot":
results = await generate_cot(
self.graph_storage,
self.synthesizer_llm_client,
method_params=partition_config["method_params"],
)
else:
raise ValueError(f"Unknown generation mode: {mode}")
# Step 2: generate QA pairs
# TODO
batches = await partition_kg(
self.graph_storage, self.tokenizer_instance, partition_config
)

# Step 3: format
results = format_generation_results(
results, output_data_format=generate_config["data_format"]
# Step 2: generate QA pairs
results = await generate_qas(
self.synthesizer_llm_client, batches, generate_config
)

if not results:
logger.warning("No QA pairs generated")
return

# Step 3: store the generated QA pairs
await self.qa_storage.upsert(results)
await self.qa_storage.index_done_callback()

Expand Down
Loading