-
Notifications
You must be signed in to change notification settings - Fork 63
refactor: Partitioner & Generator #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
f9dc25f
c0fa9f9
4762477
69e0d6a
b4431a2
e462ce0
7d4f1e5
ed27cd1
a0de4a5
138d55b
2cd2e68
43c8196
982ece5
4c683b8
dcae889
9ed56a6
f072c2e
55667d7
c5958f8
28a3584
a132627
8d71fdc
119718a
c356c4f
b4eed5b
1d4ef67
d80364d
c986614
f243708
b964bcf
55692c5
d53cfe2
e601068
fed4baa
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,84 @@ | ||
| from abc import ABC, abstractmethod | ||
| from dataclasses import dataclass | ||
| from typing import Any | ||
|
|
||
| from graphgen.bases.base_llm_client import BaseLLMClient | ||
|
|
||
|
|
||
| @dataclass | ||
| class BaseGenerator(ABC): | ||
| """ | ||
| Generate QAs based on given prompts. | ||
| """ | ||
|
|
||
| llm_client: BaseLLMClient | ||
|
|
||
| @staticmethod | ||
| @abstractmethod | ||
| def build_prompt( | ||
| batch: tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]] | ||
| ) -> str: | ||
| """Build prompt for LLM based on the given batch""" | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles. |
||
|
|
||
| @staticmethod | ||
| @abstractmethod | ||
| def parse_response(response: str) -> Any: | ||
| """Parse the LLM response and return the generated QAs""" | ||
|
|
||
| async def generate( | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability. More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability. More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability. More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The return type Any for parse_response is very broad. Based on its usage in generate and the format_generation_results method, it#x27s expected to return a more specific structure, likely list[dict[str, Any]] or dict[str, Any]. Clarifying this will improve type safety and readability. More critically, the current implementation of generate (line 39: result.update(qa_pairs)) implies qa_pairs should be a dictionary for dict.update() to work correctly. However, format_generation_results (line 44) expects results to be list[dict], suggesting qa_pairs might be a list of dictionaries. This is a significant inconsistency that needs to be resolved. |
||
| self, | ||
| batch: tuple[ | ||
| list[tuple[str, dict]], list[tuple[Any, Any, dict] | tuple[Any, Any, Any]] | ||
| ], | ||
| ) -> dict[str, Any]: | ||
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]). If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]). If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]). If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]). If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for the batch argument in the generate method (list[tuple[Any, Any, dict] tuple[Any, Any, Any]] for the second element of the outer tuple) is broader than what the abstract build_prompt method expects (list[tuple[Any, Any, dict]]). If generate passes a batch containing tuple[Any, Any, Any] elements in the second list, but build_prompt#x27s implementation expects a dictionary as the third element of these tuples, it could lead to a TypeError or AttributeError at runtime when build_prompt tries to access dictionary methods on a non-dictionary object. This is a potential bug and a type-safety issue. Ensure the type hints are consistent, or build_prompt#x27s type hint should be updated to match generate#x27s if it can indeed handle both types of tuples. |
||
| """ | ||
| Generate QAs based on a given batch. | ||
| :param batch | ||
| :return: QA pairs | ||
| """ | ||
| result = {} | ||
| prompt = self.build_prompt(batch) | ||
| response = await self.llm_client.generate_answer(prompt) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries. To fix this, clarify the expected structure of qa_pairs:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries. To fix this, clarify the expected structure of qa_pairs:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries. To fix this, clarify the expected structure of qa_pairs:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This line will likely cause a TypeError if qa_pairs is a list (e.g., list[dict]), which is implied by the format_generation_results method that expects results: list[dict]. The dict.update() method expects a dictionary or an iterable of key-value pairs, not a list of dictionaries. To fix this, clarify the expected structure of qa_pairs:
|
||
| qa_pairs = self.parse_response(response) # generate one or more QA pairs | ||
| result.update(qa_pairs) | ||
| return result | ||
|
|
||
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs. If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError. To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs. If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError. To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs. If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError. To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs. If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError. To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The parse_response method#x27s return type is Any, but the line result.update(qa_pairs) implies that qa_pairs must be a dictionary or an iterable of key-value pairs. If parse_response is intended to return a list of dictionaries (e.g., list[dict[str, str]] as implied by the term quotQA pairsquot and the structure expected by format_generation_results when processing individual items), then result.update(qa_pairs) will raise a TypeError. To ensure type safety and clarity, specify the return type of parse_response (e.g., dict[str, Any]) and ensure result.update() is used correctly. If parse_response returns a list, the logic for updating result needs to be adjusted accordingly to avoid a runtime error. |
||
| @staticmethod | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]). Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]). Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]). Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint results: list[dict] is too general for the internal logic. The list comprehension for item in results for k, v in item.items() suggests that item is a dictionary itself, and v is another dictionary containing #x27question#x27 and #x27answer#x27 (e.g., list[dict[str, dict[str, str]]]). Additionally, the key k (the ID from item.items()) is ignored in all the list comprehensions. If this key has no semantic importance, consider if parse_response could return a simpler structure (e.g., list[dict[str, str]] directly) to avoid this intermediate mapping. If k is important, it should be included in the formatted output. |
||
| def format_generation_results( | ||
| results: list[dict], output_data_format: str | ||
| ) -> list[dict[str, Any]]: | ||
| if output_data_format == "Alpaca": | ||
| results = [ | ||
| { | ||
| "instruction": v["question"], | ||
| "input": "", | ||
| "output": v["answer"], | ||
| } | ||
| for item in results | ||
| for k, v in item.items() | ||
| ] | ||
| elif output_data_format == "Sharegpt": | ||
| results = [ | ||
| { | ||
| "conversations": [ | ||
| {"from": "human", "value": v["question"]}, | ||
| {"from": "gpt", "value": v["answer"]}, | ||
| ] | ||
| } | ||
| for item in results | ||
| for k, v in item.items() | ||
| ] | ||
| elif output_data_format == "ChatML": | ||
| results = [ | ||
| { | ||
| "messages": [ | ||
| {"role": "user", "content": v["question"]}, | ||
| {"role": "assistant", "content": v["answer"]}, | ||
| ] | ||
| } | ||
| for item in results | ||
| for k, v in item.items() | ||
| ] | ||
| else: | ||
| raise ValueError(f"Unknown output data format: {output_data_format}") | ||
| return results | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,76 @@ | ||
| from abc import ABC, abstractmethod | ||
| from dataclasses import dataclass | ||
| from typing import Any, List | ||
|
|
||
| from graphgen.bases.base_storage import BaseGraphStorage | ||
| from graphgen.bases.datatypes import Community | ||
|
|
||
|
|
||
| @dataclass | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @DataClass is typically used for classes primarily holding data, which automatically generates methods like init, repr, etc. For an abstract base class (ABC) that primarily defines an interface with abstract methods, @DataClass is usually unnecessary and can be misleading, as it doesn#x27t define any data fields here that would benefit from its features. Consider removing @DataClass if BasePartitioner is solely an interface. |
||
| class BasePartitioner(ABC): | ||
| @abstractmethod | ||
| async def partition( | ||
| self, | ||
| g: BaseGraphStorage, | ||
| **kwargs: Any, | ||
| ) -> List[Community]: | ||
| """ | ||
| Graph -> Communities | ||
| :param g: Graph storage instance | ||
| :param kwargs: Additional parameters for partitioning | ||
| :return: List of communities | ||
| """ | ||
|
|
||
| @staticmethod | ||
| async def community2batch( | ||
| communities: List[Community], g: BaseGraphStorage | ||
| ) -> list[ | ||
| tuple[ | ||
| list[tuple[str, dict]], list[tuple[Any, Any, dict] | tuple[Any, Any, Any]] | ||
| ] | ||
| ]: | ||
| """ | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In community2batch, if await g.get_node(node) returns None or await g.get_edge(u, v) (or v, u) returns None, those nodes/edges are silently skipped. This might be a desired behavior if None implies quotno data available for this elementquot. However, if a community is expected to contain valid, retrievable nodes and edges from the graph storage, silently skipping them could mask inconsistencies or data issues. Consider adding logging or raising an error if a node/edge from a community cannot be found in the graph, depending on the expected behavior. |
||
| Convert communities to batches of nodes and edges. | ||
| :param communities | ||
| :param g: Graph storage instance | ||
| :return: List of batches, each batch is a tuple of (nodes, edges) | ||
| """ | ||
| batches = [] | ||
| for comm in communities: | ||
| nodes = comm.nodes | ||
| edges = comm.edges | ||
| nodes_data = [] | ||
| for node in nodes: | ||
| node_data = await g.get_node(node) | ||
| if node_data: | ||
| nodes_data.append((node, node_data)) | ||
| edges_data = [] | ||
| for u, v in edges: | ||
| edge_data = await g.get_edge(u, v) | ||
| if edge_data: | ||
| edges_data.append((u, v, edge_data)) | ||
| else: | ||
| edge_data = await g.get_edge(v, u) | ||
| if edge_data: | ||
| edges_data.append((v, u, edge_data)) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}. If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}. If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}. If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The _build_adjacency_list method builds an undirected adjacency list and edge set by adding both (e[0], e[1]) and (e[1], e[0]). This assumes the graph (or at least the representation within the community context) is undirected. If the underlying BaseGraphStorage or the communities can represent directed graphs, this method might incorrectly represent the graph structure. Ensure this behavior is consistent with the intended graph model. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In _build_adjacency_list, the adjacency list adj is initialized only with nodes present in the nodes parameter: adj: dict[str, List[str]] = {n[0]: [] for n in nodes}. If an edge e = (u, v, data) exists where u or v (or both) are not present in the nodes list, accessing adj[e[0]] or adj[e[1]] will raise a KeyError. This could happen if nodes is not a comprehensive list of all nodes involved in the edges list. Consider handling this case, for example, by ensuring all nodes involved in edges are pre-populated in adj, or by using a collections.defaultdict(list) for adj to automatically create entries for new nodes. |
||
| batches.append((nodes_data, edges_data)) | ||
| return batches | ||
|
|
||
| @staticmethod | ||
| def _build_adjacency_list( | ||
| nodes: List[tuple[str, dict]], edges: List[tuple[str, str, dict]] | ||
| ) -> tuple[dict[str, List[str]], set[tuple[str, str]]]: | ||
| """ | ||
| Build adjacency list and edge set from nodes and edges. | ||
| :param nodes | ||
| :param edges | ||
| :return: adjacency list, edge set | ||
| """ | ||
| adj: dict[str, List[str]] = {n[0]: [] for n in nodes} | ||
| edge_set: set[tuple[str, str]] = set() | ||
| for e in edges: | ||
| adj[e[0]].append(e[1]) | ||
| adj[e[1]].append(e[0]) | ||
| edge_set.add((e[0], e[1])) | ||
| edge_set.add((e[1], e[0])) | ||
| return adj, edge_set | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -78,7 +78,7 @@ async def get_node(self, node_id: str) -> Union[dict, None]: | |
| async def update_node(self, node_id: str, node_data: dict[str, str]): | ||
| raise NotImplementedError | ||
|
|
||
| async def get_all_nodes(self) -> Union[list[dict], None]: | ||
| async def get_all_nodes(self) -> Union[list[tuple[str, dict]], None]: | ||
| raise NotImplementedError | ||
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The change to the return type hint Union[list[tuple[str, str, dict]], None] for get_all_edges is a significant improvement. It provides much clearer and more specific information about the structure of the edges returned by this method, enhancing maintainability and readability for anyone implementing or consuming BaseStorage. This is good API design for an abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good improvement. Changing the return type hint from list[dict] to list[tuple[str, dict]] provides a more specific and accurate description of the expected node data (likely (node_id, node_data)), improving type clarity and maintainability for implementations of this abstract method. |
||
|
|
||
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage. Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage. Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage. Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage. Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This change updates the return type hint of get_all_nodes from list[dict] to list[tuple[str, dict]]. This is a breaking change to the interface of BaseStorage. Any concrete implementation of BaseStorage that overrides this method will need to be updated to return list[tuple[str, dict]] to conform to the new contract. Please ensure all existing implementations are updated accordingly to avoid type-related bugs or runtime errors. This change likely implies that the node ID is now needed alongside its data when retrieving all nodes, which can be beneficial for downstream consumers. |
||
| async def get_edge( | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability. However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability. However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability. However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This type hint refinement from list[dict] to list[tuple[str, str, dict]] is a positive change for clarity and type safety. It makes the structure of the returned edges much more explicit, which improves maintainability. However, this change in the expected data structure is effectively a breaking change for any consumers of get_all_edges that were previously expecting a generic dictionary for each edge. Please ensure all call sites are updated to correctly handle the new (source, target, data) tuple format. |
||
|
|
@@ -91,7 +91,7 @@ async def update_edge( | |
| ): | ||
| raise NotImplementedError | ||
|
|
||
| async def get_all_edges(self) -> Union[list[dict], None]: | ||
| async def get_all_edges(self) -> Union[list[tuple[str, str, dict]], None]: | ||
| raise NotImplementedError | ||
|
|
||
| async def get_node_edges( | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -30,3 +30,11 @@ class Token: | |
| @property | ||
| def logprob(self) -> float: | ||
| return math.log(self.prob) | ||
|
|
||
|
|
||
| @dataclass | ||
| class Community: | ||
| id: Union[int, str] | ||
| nodes: List[str] = field(default_factory=list) | ||
| edges: List[tuple] = field(default_factory=list) | ||
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
ChenZiHong-Gavin marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)). Consider making this more specific for better type checking and clarity, for example:
This improves maintainability by making the expected structure of edge data explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)). Consider making this more specific for better type checking and clarity, for example:
This improves maintainability by making the expected structure of edge data explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)). Consider making this more specific for better type checking and clarity, for example:
This improves maintainability by making the expected structure of edge data explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)). Consider making this more specific for better type checking and clarity, for example:
This improves maintainability by making the expected structure of edge data explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type hint for edges (List[tuple]) is quite broad. Given that nodes are List[str], it#x27s likely that edges are tuples of node IDs (e.g., (source_node_id, target_node_id)). Consider making this more specific for better type checking and clarity, for example:
This improves maintainability by making the expected structure of edge data explicit. |
||
| metadata: dict = field(default_factory=dict) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,21 +18,14 @@ | |
| from graphgen.operators import ( | ||
| build_kg, | ||
| chunk_documents, | ||
| generate_cot, | ||
| generate_qas, | ||
| judge_statement, | ||
| partition_kg, | ||
| quiz, | ||
| read_files, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The removal of traverse_graph_for_aggregated, traverse_graph_for_atomic, and traverse_graph_for_multi_hop from graphgen.operators imports is a significant change. Please confirm that these graph traversal functions are no longer required by graphgen.py or that their functionality has been properly superseded or moved as part of the refactor. If other parts of the system depend on this module#x27s prior usage or implicit re-export of these, this could lead to unexpected behavior. A brief explanation in the PR description regarding these removals would greatly improve maintainability for future developers. |
||
| search_all, | ||
| traverse_graph_for_aggregated, | ||
| traverse_graph_for_atomic, | ||
| traverse_graph_for_multi_hop, | ||
| ) | ||
| from graphgen.utils import ( | ||
| async_to_sync_method, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The format_generation_results utility function is also no longer imported. Please confirm that results formatting is now handled elsewhere, is no longer needed, or that a new mechanism is in place to ensure downstream consumers receive correctly formatted output. This is important for data consistency and maintainability, especially in the context of the #x27Generator#x27 refactor mentioned in the PR title. Providing context in the PR description for such changes would be beneficial. |
||
| compute_content_hash, | ||
| format_generation_results, | ||
| logger, | ||
| ) | ||
| from graphgen.utils import async_to_sync_method, compute_content_hash, logger | ||
|
|
||
| sys_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) | ||
|
|
||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The previous implementation passed partition_config[quotmethod_paramsquot] to various graph traversal functions. While partition_kg now receives partition_config directly, please ensure that all necessary parameters previously extracted from method_params are correctly handled and propagated within partition_kg if they are still required for the partitioning logic. This is important for functional correctness. |
||
|
|
@@ -238,51 +231,20 @@ async def quiz_and_judge(self, quiz_and_judge_config: Dict): | |
| @async_to_sync_method | ||
| async def generate(self, partition_config: Dict, generate_config: Dict): | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas. Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior. Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas. Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior. Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas. Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior. Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas. Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior. Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas. Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior. Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas. Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior. Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The original generate method passed several instance attributes like self.tokenizer_instance, self.text_chunks_storage, and self.progress_bar to the graph traversal functions (e.g., traverse_graph_for_atomic). These are no longer explicitly passed to generate_qas. Could you confirm how these dependencies are managed now? If the logic within generate_qas (or functions it calls) still relies on these components, they either need to be passed as arguments or accessible via other means. Missing these could lead to runtime errors or incorrect behavior. Additionally, the format_generation_results call, which used generate_config[quotdata_formatquot], has been removed. Please confirm that generate_qas now handles the final data formatting internally, ensuring that the output_data_format specified in generate_config is still respected. |
||
| # Step 1: partition the graph | ||
| # TODO: implement graph partitioning, e.g. Partitioner().partition(self.graph_storage) | ||
| mode = generate_config["mode"] | ||
| if mode == "atomic": | ||
| results = await traverse_graph_for_atomic( | ||
| self.synthesizer_llm_client, | ||
| self.tokenizer_instance, | ||
| self.graph_storage, | ||
| partition_config["method_params"], | ||
| self.text_chunks_storage, | ||
| self.progress_bar, | ||
| ) | ||
| elif mode == "multi_hop": | ||
| results = await traverse_graph_for_multi_hop( | ||
| self.synthesizer_llm_client, | ||
| self.tokenizer_instance, | ||
| self.graph_storage, | ||
| partition_config["method_params"], | ||
| self.text_chunks_storage, | ||
| self.progress_bar, | ||
| ) | ||
| elif mode == "aggregated": | ||
| results = await traverse_graph_for_aggregated( | ||
| self.synthesizer_llm_client, | ||
| self.tokenizer_instance, | ||
| self.graph_storage, | ||
| partition_config["method_params"], | ||
| self.text_chunks_storage, | ||
| self.progress_bar, | ||
| ) | ||
| elif mode == "cot": | ||
| results = await generate_cot( | ||
| self.graph_storage, | ||
| self.synthesizer_llm_client, | ||
| method_params=partition_config["method_params"], | ||
| ) | ||
| else: | ||
| raise ValueError(f"Unknown generation mode: {mode}") | ||
| # Step 2: generate QA pairs | ||
| # TODO | ||
| batches = await partition_kg( | ||
| self.graph_storage, self.tokenizer_instance, partition_config | ||
| ) | ||
|
|
||
| # Step 3: format | ||
| results = format_generation_results( | ||
| results, output_data_format=generate_config["data_format"] | ||
| # Step 2: generate QA pairs | ||
| results = await generate_qas( | ||
| self.synthesizer_llm_client, batches, generate_config | ||
| ) | ||
|
|
||
| if not results: | ||
| logger.warning("No QA pairs generated") | ||
| return | ||
|
|
||
| # Step 3: store the generated QA pairs | ||
| await self.qa_storage.upsert(results) | ||
| await self.qa_storage.index_done_callback() | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint for batch here (tuple[list[tuple[str, dict]], list[tuple[Any, Any, dict]]]) is slightly different from the batch parameter in the generate method (line 29), which allows tuple[Any, Any, Any] in the second list#x27s elements. Since generate calls build_prompt with its batch, these type hints should be consistent to avoid potential type checking issues or runtime errors if build_prompt receives a batch structure it doesn#x27t explicitly declare it handles.