refactor: refactor pipeline engine using ray data #110

ChenZiHong-Gavin · 2025-12-03T09:28:59Z

This PR fundamentally refactors GraphGen's pipeline execution engine from a custom threading-based orchestration system to a Ray Data-native distributed processing framework. The changes improve scalability, resource management, and maintainability while preserving the core pipeline semantics.

graphgen/operators/chunk/chunk_service.py

graphgen/engine.py

…to refactor/refactor-with-ray-data

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

…open-sciencelab/GraphGen into refactor/refactor-with-ray-data

graphgen/operators/partition/partition_service.py

ChenZiHong-Gavin · 2025-12-15T11:24:01Z

/gemini summary
/gemini review

gemini-code-assist · 2025-12-15T11:24:48Z

Hi @ChenZiHong-Gavin, it looks like you've tried to invoke the /gemini summary and /gemini review commands within this comment. Please note that these commands need to be posted as separate, top-level comments in the pull request conversation to be processed correctly. I am unable to execute them when they are part of a larger comment body.

ChenZiHong-Gavin · 2025-12-15T14:36:08Z

/gemini summary

ChenZiHong-Gavin · 2025-12-15T14:36:16Z

/gemini review

gemini-code-assist · 2025-12-15T14:39:22Z

Summary of Changes

This pull request introduces a major architectural overhaul to the GraphGen project, transitioning its pipeline execution from a custom threading model to a robust, distributed framework built on Ray Data. This refactoring is designed to enhance the system's scalability, resource efficiency, and maintainability by standardizing operations around Ray's data processing primitives. Key changes include the introduction of a BaseOperator for all pipeline steps, the migration of existing operations to leverage Ray Data, and the centralization of LLM and storage initialization through Ray actors. The configuration schema has also been updated to define dataflow graphs, and the project's file structure has been reorganized to align with the new architecture.

Highlights

Pipeline Engine Refactoring: The core pipeline execution engine has been fundamentally refactored to utilize Ray Data, moving away from a custom threading-based orchestration system. This change aims to significantly improve scalability, resource management, and overall maintainability.
Introduction of BaseOperator: A new BaseOperator class has been introduced, serving as the foundational interface for all pipeline operations. This standardization facilitates seamless integration with Ray Data's distributed processing capabilities.
Migration to Ray Data: All individual pipeline steps, including reading, chunking, knowledge graph building, quizzing, judging, extraction, partitioning, and generation, have been adapted to leverage Ray Data primitives like map_batches, flat_map, filter, and aggregate for distributed execution.
Centralized LLM and Storage Management: New init_llm and init_storage functions are now available in graphgen.common. These functions manage the initialization of LLM wrappers and storage backends (including new support for KuzuDB and RocksDB) as Ray actors, enabling efficient distributed access across the pipeline.
Updated Configuration Structure: The pipeline configuration (.yaml files) has been updated to reflect the new Ray Data-native approach. Configurations now define nodes with explicit id, op_name, type, dependencies, and execution_params to clearly outline the dataflow graph.
Project Structure Reorganization: Numerous example scripts and configuration files have been relocated from scripts/ and resources/ to a new examples/ directory. The graphgen/configs/ directory has been removed, streamlining the project structure and consolidating examples.

Changelog

baselines/BDS/bds.py
- Updated import path for init_llm from graphgen.operators to graphgen.common.
- Simplified init_llm call within the BDS class constructor.
examples/extract/extract_schema_guided/README.md
- New file added, providing documentation for schema-guided extraction.
examples/extract/extract_schema_guided/extract_schema_guided.sh
- New file added, containing a shell script to run schema-guided extraction.
examples/extract/extract_schema_guided/schema_guided_extraction_config.yaml
- New file added, defining the configuration for schema-guided extraction using the new pipeline structure.
examples/generate/generate_aggregated_qa/README.md
- New file added, providing documentation for aggregated QA generation.
examples/generate/generate_aggregated_qa/aggregated_config.yaml
- New file added, defining the configuration for aggregated QA generation using the new pipeline structure.
examples/generate/generate_aggregated_qa/generate_aggregated.sh
- New file added, containing a shell script to run aggregated QA generation.
examples/generate/generate_atomic_qa/README.md
- New file added, providing documentation for atomic QA generation.
examples/generate/generate_atomic_qa/atomic_config.yaml
- New file added, defining the configuration for atomic QA generation using the new pipeline structure.
examples/generate/generate_atomic_qa/generate_atomic.sh
- New file added, containing a shell script to run atomic QA generation.
examples/generate/generate_cot_qa/README.md
- New file added, providing documentation for CoT QA generation.
examples/generate/generate_cot_qa/cot_config.yaml
- New file added, defining the configuration for CoT QA generation using the new pipeline structure.
examples/generate/generate_cot_qa/generate_cot.sh
- New file added, containing a shell script to run CoT QA generation.
examples/generate/generate_multi_hop_qa/README.md
- New file added, providing documentation for multi-hop QA generation.
examples/generate/generate_multi_hop_qa/generate_multi_hop.sh
- New file added, containing a shell script to run multi-hop QA generation.
examples/generate/generate_multi_hop_qa/multi_hop_config.yaml
- New file added, defining the configuration for multi-hop QA generation using the new pipeline structure.
examples/generate/generate_vqa/README.md
- New file added, providing documentation for VQA generation.
examples/generate/generate_vqa/generate_vqa.sh
- New file added, containing a shell script to run VQA generation.
examples/generate/generate_vqa/vqa_config.yaml
- New file added, defining the configuration for VQA generation using the new pipeline structure.
graphgen/bases/init.py
- Added BaseOperator import.
- Removed BaseListStorage import.
- Added Config and Node to datatypes import.
graphgen/bases/base_llm_wrapper.py
- Removed shutdown and restart methods, as LLM services are now managed as Ray actors.
graphgen/bases/base_operator.py
- New file added, defining the abstract BaseOperator class for all pipeline operations, including Ray Data integration and context-aware logging.
graphgen/bases/base_partitioner.py
- Changed partition method from asynchronous to synchronous.
- Modified community2batch to process a single community and operate synchronously.
graphgen/bases/base_reader.py
- Updated read method signature to return ray.data.Dataset and accept Union[str, List[str]] for Ray Data compatibility.
- Replaced the static filter method with instance methods _should_keep_item and _validate_batch for data validation and filtering within Ray Data.
- Added modalities parameter to the constructor.
graphgen/bases/base_splitter.py
- Updated logger import path.
- Changed terminology from 'split' to 'chunk' in docstrings for clarity.
graphgen/bases/base_storage.py
- Removed BaseListStorage class.
- Added reload method to BaseKVStorage and BaseGraphStorage interfaces.
graphgen/bases/datatypes.py
- Added Node and Config Pydantic models for defining pipeline nodes and overall configurations.
graphgen/common/init.py
- New file added, consolidating init_llm and init_storage imports.
graphgen/common/init_llm.py
- New file added, providing LLMServiceActor, LLMServiceProxy, and LLMFactory for distributed LLM management as Ray actors.
graphgen/common/init_storage.py
- New file added, providing KVStorageActor, GraphStorageActor, RemoteKVStorageProxy, RemoteGraphStorageProxy, and StorageFactory for distributed storage management as Ray actors.
graphgen/configs/README.md
- File removed due to restructuring of configurations into examples/.
graphgen/configs/init.py
- File renamed to graphgen/models/storage/graph/__init__.py.
graphgen/configs/aggregated_config.yaml
- File removed due to restructuring of configurations into examples/.
graphgen/configs/atomic_config.yaml
- File removed due to restructuring of configurations into examples/.
graphgen/configs/cot_config.yaml
- File removed due to restructuring of configurations into examples/.
graphgen/configs/multi_hop_config.yaml
- File removed due to restructuring of configurations into examples/.
graphgen/configs/schema_guided_extraction_config.yaml
- File removed due to restructuring of configurations into examples/.
graphgen/configs/search_dna_config.yaml
- File renamed to examples/search/search_dna_config.yaml.
graphgen/configs/search_protein_config.yaml
- File renamed to examples/search/search_protein_config.yaml.
graphgen/configs/search_rna_config.yaml
- File renamed to examples/search/search_rna_config.yaml.
graphgen/configs/vqa_config.yaml
- File removed due to restructuring of configurations into examples/.
graphgen/engine.py
- Old Context and OpNode classes removed.
- The Engine class has been completely refactored to orchestrate Ray Data pipelines, including topological sorting of nodes and execution of Ray Data operations (map_batches, filter, flatmap, aggregate).
graphgen/evaluate.py
- File renamed to graphgen/operators/evaluate/evaluate.py.
- Updated imports for LengthEvaluator, MTLDEvaluator, RewardEvaluator, UniEvaluator to reflect their new location in graphgen.models.
graphgen/graphgen.py
- File removed, as its functionality is replaced by the new Ray Data-native Engine and operator services.
graphgen/models/init.py
- Removed JSONLReader import.
- Added KuzuStorage and RocksDBKVStorage imports.
- Removed JsonListStorage import.
graphgen/models/extractor/schema_guided_extractor.py
- Adjusted how _chunk_id and text are accessed from the input chunk dictionary.
- Changed merge_extractions method from asynchronous to synchronous.
graphgen/models/generator/vqa_generator.py
- Updated the key for accessing image data from 'images' to 'image_data' in node properties.
graphgen/models/llm/local/sglang_wrapper.py
- Removed shutdown and restart methods, as LLM services are now managed externally as Ray actors.
graphgen/models/partitioner/anchor_bfs_partitioner.py
- Changed partition, _pick_anchor_ids, and _grow_community methods from asynchronous to synchronous.
- Modified partition to yield Community objects iteratively instead of returning a complete list.
graphgen/models/partitioner/bfs_partitioner.py
- Changed partition method from asynchronous to synchronous.
- Modified partition to yield Community objects iteratively instead of returning a complete list.
graphgen/models/partitioner/dfs_partitioner.py
- Changed partition method from asynchronous to synchronous.
- Modified partition to yield Community objects iteratively instead of returning a complete list.
graphgen/models/partitioner/ece_partitioner.py
- Updated imports (tqdm.asyncio to tqdm, asyncio removed).
- Changed partition and _grow_community methods from asynchronous to synchronous.
- Modified partition to yield Community objects iteratively instead of returning a complete list.
graphgen/models/partitioner/leiden_partitioner.py
- Changed partition, _run_leiden, and _split_communities methods from asynchronous to synchronous.
graphgen/models/reader/init.py
- Removed JSONLReader import.
graphgen/models/reader/csv_reader.py
- Updated read method to use ray.data.read_csv and integrate _validate_batch and _should_keep_item for Ray Data processing.
- Changed return type to ray.data.Dataset.
graphgen/models/reader/json_reader.py
- Refactored read method to handle both JSON and JSONL formats, utilize ray.data.read_json, and integrate _validate_batch and _should_keep_item.
- Added a static method _unify_schema for consistent JSON data processing.
- Changed return type to ray.data.Dataset.
graphgen/models/reader/jsonl_reader.py
- File removed, as its functionality is now integrated into json_reader.py.
graphgen/models/reader/parquet_reader.py
- Updated read method to use ray.data.read_parquet and integrate _validate_batch and _should_keep_item.
- Changed return type to ray.data.Dataset.
graphgen/models/reader/pdf_reader.py
- Updated read method to use Ray Data for parallel processing of PDF files.
- Changed return type to ray.data.Dataset.
- Modified _try_load_cached_result to correctly adjust file paths and content.
graphgen/models/reader/pickle_reader.py
- Updated read method to use ray.data.read_binary_files for pickle files, including deserialization and validation logic.
- Changed return type to ray.data.Dataset.
graphgen/models/reader/rdf_reader.py
- Updated read method to use Ray Data for parallel processing of RDF files.
- Changed return type to ray.data.Dataset.
- Added _parse_rdf_file for single file processing and improved error handling.
graphgen/models/reader/txt_reader.py
- Updated read method to use ray.data.read_binary_files for text files.
- Changed return type to ray.data.Dataset.
graphgen/models/splitter/character_splitter.py
- Changed terminology from 'split' to 'chunk' in docstring.
graphgen/models/splitter/markdown_splitter.py
- Changed terminology from 'split' to 'chunk' in docstrings.
graphgen/models/splitter/recursive_character_splitter.py
- Changed terminology from 'split' to 'chunk' in docstrings.
graphgen/models/storage/graph/kuzu_storage.py
- New file added, implementing KuzuStorage as a BaseGraphStorage backend using KuzuDB for graph persistence.
graphgen/models/storage/json_storage.py
- File renamed to graphgen/models/storage/kv/json_storage.py.
- Removed BaseListStorage import and the JsonListStorage class.
- Removed logger imports and replaced logger.info calls with print statements.
- Added a reload method to re-load data from the JSON file.
graphgen/models/storage/kv/rocksdb_storage.py
- New file added, implementing RocksDBKVStorage as a BaseKVStorage backend using RocksDB for key-value persistence.
graphgen/models/storage/networkx_storage.py
- File renamed to graphgen/models/storage/graph/networkx_storage.py.
- Removed logger imports and replaced logger.info/logger.warning calls with print statements.
- Added a reload method to re-initialize the graph from its GraphML file.
graphgen/operators/init.py
- Replaced individual operator imports with new service classes (e.g., BuildKGService, ChunkService).
- Introduced an operators dictionary mapping operator names to their respective service classes or functions for dynamic lookup.
graphgen/operators/build_kg/init.py
- Updated import to BuildKGService.
graphgen/operators/build_kg/build_kg.py
- File removed, replaced by build_kg_service.py.
graphgen/operators/build_kg/build_kg_service.py
- New file added, implementing BuildKGService as a BaseOperator for building knowledge graphs, integrating LLM and storage initialization.
graphgen/operators/build_kg/build_mm_kg.py
- Removed gradio import and progress_bar parameter from function signature.
- Changed build_mm_kg and internal run_concurrent calls from asynchronous to synchronous.
graphgen/operators/build_kg/build_text_kg.py
- Removed gradio import and progress_bar parameter from function signature.
- Changed build_text_kg and internal run_concurrent calls from asynchronous to synchronous.
graphgen/operators/chunk/init.py
- New file added, importing ChunkService.
graphgen/operators/chunk/chunk_service.py
- New file added, implementing ChunkService as a BaseOperator for document chunking, including language detection and storage integration.
graphgen/operators/extract/init.py
- Updated import to ExtractService.
graphgen/operators/extract/extract_info.py
- File removed, replaced by extract_service.py.
graphgen/operators/extract/extract_service.py
- New file added, implementing ExtractService as a BaseOperator for information extraction, supporting schema-guided methods.
graphgen/operators/generate/init.py
- Updated import to GenerateService.
graphgen/operators/generate/generate_qas.py
- File removed, replaced by generate_service.py.
graphgen/operators/generate/generate_service.py
- New file added, implementing GenerateService as a BaseOperator for QA generation, supporting various methods like atomic, aggregated, multi-hop, CoT, and VQA.
graphgen/operators/init/init.py
- File removed, as init_llm is now in graphgen.common.
graphgen/operators/init/init_llm.py
- File removed, as its functionality is now in graphgen/common/init_llm.py.
graphgen/operators/judge/init.py
- New file added, importing JudgeService.
graphgen/operators/judge/judge_service.py
- New file added, implementing JudgeService as a BaseOperator for judging graph elements and computing comprehension loss using a trainee LLM.
graphgen/operators/partition/init.py
- Updated import to PartitionService.
graphgen/operators/partition/partition_kg.py
- File removed, replaced by partition_service.py.
graphgen/operators/partition/partition_service.py
- New file added, implementing PartitionService as a BaseOperator for partitioning knowledge graphs using various methods (BFS, DFS, ECE, Leiden, Anchor BFS). Includes pre-tokenization and data attachment logic.
graphgen/operators/partition/pre_tokenize.py
- File removed, as its functionality is now integrated into partition_service.py.
graphgen/operators/quiz/init.py
- New file added, importing QuizService.
graphgen/operators/quiz/quiz_service.py
- New file added, implementing QuizService as a BaseOperator for generating quizzes based on graph elements and storing them.
graphgen/operators/quiz_and_judge/init.py
- File removed, as quiz and judge operations are now separate services.
graphgen/operators/quiz_and_judge/judge.py
- File removed, replaced by judge_service.py.
graphgen/operators/quiz_and_judge/quiz.py
- File removed, replaced by quiz_service.py.
graphgen/operators/read/init.py
- Updated import to read.
graphgen/operators/read/parallel_file_scanner.py
- Updated RocksDBCache path.
- Removed explicit logger calls, relying on the new context-aware logger.
- Removed logger.warning/logger.error/logger.info calls, relying on the new context-aware logger.
graphgen/operators/read/read.py
- New file added, providing a unified read function that leverages ParallelFileScanner and various BaseReader implementations to read files into Ray Data.
graphgen/operators/read/read_files.py
- File removed, replaced by read.py.
graphgen/operators/split/init.py
- File removed, as chunking is now handled by chunk_service.py.
graphgen/operators/split/split_chunks.py
- File removed, replaced by chunk_service.py.
graphgen/operators/storage.py
- File removed, as storage management is now handled by graphgen/common/init_storage.py and individual operator services.
graphgen/run.py
- Updated config_file default path using importlib.resources for better package resource handling.
- Removed imports for Context, GraphGen, and collect_ops.
- Imported operators from graphgen.operators.
- Refactored the main function to use the new Engine and Ray Data for pipeline execution and output.
- Introduced NodeFilenameProvider for structured output of Ray Data results.
- Updated logger initialization to use CURRENT_LOGGER_VAR for context-aware logging.
graphgen/utils/init.py
- Updated log import to include CURRENT_LOGGER_VAR.
graphgen/utils/log.py
- Refactored set_logger for more flexible logger initialization with custom names and force re-initialization.
- Introduced CURRENT_LOGGER_VAR (a contextvars.ContextVar) and ContextAwareLogger to manage logger instances in a Ray-compatible, context-aware manner.
- Replaced the global logger instance with an instance of ContextAwareLogger.
graphgen/utils/run_concurrent.py
- Removed gradio import and progress_bar parameter from function signature.
- Refactored run_concurrent to be synchronous, internally managing its own asyncio event loop.
requirements.txt
- Added pydantic for data validation and settings management.
- Added ray==2.52.1 to support the new distributed execution framework.
resources/input_examples/csv_demo.csv
- File renamed to examples/input_examples/csv_demo.csv.
resources/input_examples/extract_demo.txt
- File renamed to examples/input_examples/extract_demo.txt.
resources/input_examples/graphml_demo.graphml
- File renamed to examples/input_examples/graphml_demo.graphml.
resources/input_examples/json_demo.json
- File renamed to examples/input_examples/json_demo.json.
resources/input_examples/jsonl_demo.jsonl
- File renamed to examples/input_examples/jsonl_demo.jsonl.
resources/input_examples/search_dna_demo.jsonl
- File renamed to examples/input_examples/search_dna_demo.jsonl.
resources/input_examples/search_protein_demo.jsonl
- File renamed to examples/input_examples/search_protein_demo.jsonl.
resources/input_examples/search_rna_demo.jsonl
- File renamed to examples/input_examples/search_rna_demo.jsonl.
resources/input_examples/txt_demo.txt
- File renamed to examples/input_examples/txt_demo.txt.
resources/input_examples/vqa_demo.json
- File renamed to examples/input_examples/vqa_demo.json.
- Modified image and table content structure to nest img_path, image_caption, table_caption, table_footnote, and table_body under a content key for consistency.
resources/output_examples/aggregated_chatml.json
- File renamed to examples/output_examples/aggregated_chatml.json.
resources/output_examples/atomic_alpaca.json
- File renamed to examples/output_examples/atomic_alpaca.json.
resources/output_examples/cot_sharegpt.json
- File renamed to examples/output_examples/cot_sharegpt.json.
resources/output_examples/multi-hop_chatml.json
- File renamed to examples/output_examples/multi-hop_chatml.json.
scripts/baselines/generate_all_baselines.sh
- File renamed to examples/baselines/generate_all_baselines.sh.
scripts/baselines/generate_bds.sh
- File renamed to examples/baselines/generate_bds.sh.
scripts/baselines/generate_entigraph.sh
- File renamed to examples/baselines/generate_entigraph.sh.
scripts/baselines/generate_genie.sh
- File renamed to examples/baselines/generate_genie.sh.
scripts/baselines/generate_longform.sh
- File renamed to examples/baselines/generate_longform.sh.
scripts/baselines/generate_selfqa.sh
- File renamed to examples/baselines/generate_selfqa.sh.
scripts/baselines/generate_wrap.sh
- File renamed to examples/baselines/generate_wrap.sh.
scripts/evaluate/evaluate.sh
- File renamed to examples/evaluate/evaluate.sh.
scripts/extract/extract_schema_guided.sh
- File removed, replaced by examples/extract/extract_schema_guided/extract_schema_guided.sh.
scripts/generate/generate_aggregated.sh
- File removed, replaced by examples/generate/generate_aggregated_qa/generate_aggregated.sh.
scripts/generate/generate_atomic.sh
- File removed, replaced by examples/generate/generate_atomic_qa/generate_atomic.sh.
scripts/generate/generate_cot.sh
- File removed, replaced by examples/generate/generate_cot_qa/generate_cot.sh.
scripts/generate/generate_multi_hop.sh
- File removed, replaced by examples/generate/generate_multi_hop_qa/generate_multi_hop.sh.
scripts/generate/generate_vqa.sh
- File removed, replaced by examples/generate/generate_vqa/generate_vqa.sh.
scripts/search/build_db/build_dna_blast_db.sh
- File renamed to examples/search/build_db/build_dna_blast_db.sh.
scripts/search/build_db/build_protein_blast_db.sh
- File renamed to examples/search/build_db/build_protein_blast_db.sh.
scripts/search/build_db/build_rna_blast_db.sh
- File renamed to examples/search/build_db/build_rna_blast_db.sh.
scripts/search/search_dna.sh
- File renamed to examples/search/search_dna.sh.
scripts/search/search_rna.sh
- File renamed to examples/search/search_rna.sh.
scripts/search/search_uniprot.sh
- File renamed to examples/search/search_uniprot.sh.

Activity

ChenZiHong-Gavin initially requested a summary and review in a single comment.
gemini-code-assist[bot] clarified that /gemini summary and /gemini review commands must be posted as separate, top-level comments.
ChenZiHong-Gavin subsequently posted /gemini summary as a separate comment.
ChenZiHong-Gavin then posted /gemini review as another separate comment.
github-code-quality[bot] identified an issue in chunk_documents regarding mixed explicit and implicit returns, suggesting an explicit return for empty input.
github-code-quality[bot] found unreachable code in graphgen/engine.py related to an elif branch that was always subsumed by a preceding if.
github-code-quality[bot] reported multiple instances of an unsupported progress_bar keyword argument being passed to run_concurrent in graphgen/operators/build_kg/build_mm_kg.py and graphgen/operators/build_kg/build_text_kg.py.
github-code-quality[bot] also noted the same progress_bar issue in run_concurrent calls within graphgen/operators/partition/partition_service.py.
github-code-quality[bot] identified an unused import of Any in graphgen/operators/partition/partition_service.py.

gemini-code-assist

Code Review

This is an impressive and substantial refactoring of the pipeline execution engine. Migrating from a custom threading-based system to a Ray Data-native framework is a significant improvement for scalability, maintainability, and robustness. The new architecture with a declarative YAML configuration, a Ray Data-based engine, and modular operators is well-designed. The use of Ray actors for managing stateful services like LLMs and storage is a solid pattern for distributed execution.

My review includes a few suggestions to further improve robustness and performance. Specifically, I've commented on improving error handling in data processing, making logging more consistent in storage classes, and optimizing database write patterns.

Overall, this is an excellent piece of work that modernizes the project's core infrastructure.

gemini-code-assist · 2025-12-15T14:43:51Z

graphgen/bases/base_reader.py

+    def _should_keep_item(self, item: Dict[str, Any]) -> bool:
+        """
+        Determine whether to keep the given item based on the text column.
+
+        :param item: Dictionary representing a data entry.
+        :return: True if the item should be kept, False otherwise.
        """
-        Filter out entries with empty or missing text in the specified column.
+        item_type = item.get("type")
+        assert item_type in [
+            "text",
+            "image",
+            "table",
+            "equation",
+            "protein",
+        ], f"Unsupported item type: {item_type}"
+        if item_type == "text":
+            content = item.get(self.text_column, "").strip()
+            return bool(content)
+        return True


In _should_keep_item, using assert to validate item_type will crash the Ray worker on unsupported types. It's more robust to log a warning and return False to filter out invalid items without halting the pipeline.

def _should_keep_item(self, item: Dict[str, Any]) -> bool: """ Determine whether to keep the given item based on the text column. :param item: Dictionary representing a data entry. :return: True if the item should be kept, False otherwise. """ item_type = item.get("type") if item_type not in { "text", "image", "table", "equation", "protein", }: # Consider logging a warning for visibility on data quality issues. return False if item_type == "text": content = item.get(self.text_column, "").strip() return bool(content) return True

gemini-code-assist · 2025-12-15T14:43:51Z

graphgen/models/storage/graph/networkx_storage.py

            self._graph.nodes[node_id].update(node_data)
        else:
-            logger.warning("Node %s not found in the graph for update.", node_id)
+            print(f"Node {node_id} not found in the graph for update.")


Replacing logger.warning with print is a regression in logging practices. print bypasses logging configuration (levels, formatters, handlers), making it difficult to control log output in different environments. It's better to use the standard logging module. Since the context-aware logger might not be available in this class, importing logging and using logging.warning(...) would be a more robust solution. This feedback applies to other print statements in this file and other storage classes (e.g., kuzu_storage.py, json_storage.py).

gemini-code-assist · 2025-12-15T14:43:51Z

graphgen/operators/judge/judge_service.py

+        except Exception as e:  # pylint: disable=broad-except
+            logger.error("Error in judging description: %s", e)
+            logger.info("Use default loss 0.1")
+            item["loss"] = -math.log(0.1)
+        return item


Catching a broad Exception can hide bugs and make debugging difficult. If the LLM client library defines more specific exceptions (e.g., for API errors, timeouts, or content filtering), it's better to catch those explicitly. This allows for more granular error handling and reporting. If specific exceptions are not available, consider using logger.exception() to automatically include traceback information in the log, which is very helpful for debugging.

gemini-code-assist · 2025-12-15T14:43:51Z

graphgen/operators/partition/partition_service.py

+    def _pre_tokenize(self) -> None:
+        """Pre-tokenize all nodes and edges to add token length information."""
+        logger.info("Starting pre-tokenization of nodes and edges...")
+
+        nodes = self.kg_instance.get_all_nodes()
+        edges = self.kg_instance.get_all_edges()
+
+        # Process nodes
+        for node_id, node_data in nodes:
+            if "length" not in node_data:
+                try:
+                    description = node_data.get("description", "")
+                    tokens = self.tokenizer_instance.encode(description)
+                    node_data["length"] = len(tokens)
+                    self.kg_instance.update_node(node_id, node_data)
+                except Exception as e:
+                    logger.warning("Failed to tokenize node %s: %s", node_id, e)
+                    node_data["length"] = 0
+
+        # Process edges
+        for u, v, edge_data in edges:
+            if "length" not in edge_data:
+                try:
+                    description = edge_data.get("description", "")
+                    tokens = self.tokenizer_instance.encode(description)
+                    edge_data["length"] = len(tokens)
+                    self.kg_instance.update_edge(u, v, edge_data)
+                except Exception as e:
+                    logger.warning("Failed to tokenize edge %s-%s: %s", u, v, e)
+                    edge_data["length"] = 0
+
+        # Persist changes
+        self.kg_instance.index_done_callback()
+        logger.info("Pre-tokenization completed.")


The _pre_tokenize method updates nodes and edges one by one inside loops. This can lead to a large number of individual write operations to the storage backend, which is inefficient, especially for database-backed storages like KuzuDB. It would be more performant to batch these updates. Consider collecting all node and edge updates into lists and then performing a bulk update operation if the storage backend supports it. This might require adding bulk update methods to the BaseGraphStorage interface for better performance.

…ing read functionality

* feat: add config and operator node types * refactor: refactor readers with ray data * fix: delete param parallelism for readers * fix: fix import error * refactor read and chunk operators with no side effects * fix: fix import error * fix: fix return logic * refactor: rename operator split to chunk * refactor: refactor build_kg to accomodate ray data * feat: add StorageFactory & global params * refactor: refactor quiz to accomodata ray data engine * fix: reload graph before quizzing * Potential fix for pull request finding 'Unreachable code' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * fix: fix quiz params * refactor: refactor quiz&judge to ray actors * fix: fix transferring quizzed data to JudgeService * refactor: refactor partition to accomodate ray data * fix: fix lint problem * refactor: refactor op generate * feat: write results in output folder * fix: raise error when no dataset is created * fix: return generator in ece_partitioner * fix: return generator in ece_partitioner * refactor: refactor data format to support multi-modal input * fix: delete fetching schema to avoid ray's duplicate execution * fix: fix operators' registry * feat: refactor schema_guided_extraction & add examples * feat: seperate ray logs and service logs * feat: use storage actor * feat: add kuzu graph database * feat: add llm as actors * refactor: delete old runner * fix: fix vllm wrapper * docs: update .env.example * fix: use kuzudb in quiz_service * fix: update webui * feat: make storage backend configuragble * docs: update README” --------- Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

feat: add config and operator node types

31c5a64

ChenZiHong-Gavin changed the title ~~feat: add config and operator node types~~ refactor: refactor pipeline engine using ray data Dec 3, 2025

ChenZiHong-Gavin added 5 commits December 3, 2025 18:43

refactor: refactor readers with ray data

8bcbe51

fix: delete param parallelism for readers

246348f

fix: fix import error

319e1e7

refactor read and chunk operators with no side effects

42fcb09

fix: fix import error

b458e48

github-code-quality bot found potential problems Dec 4, 2025

View reviewed changes

graphgen/operators/chunk/chunk_service.py Fixed Show fixed Hide fixed

ChenZiHong-Gavin added 5 commits December 4, 2025 13:20

fix: fix return logic

95c4783

refactor: rename operator split to chunk

c844d65

refactor: refactor build_kg to accomodate ray data

c447936

feat: add StorageFactory & global params

3edbb81

refactor: refactor quiz to accomodata ray data engine

ee0639d

github-code-quality bot found potential problems Dec 5, 2025

View reviewed changes

graphgen/engine.py Fixed Show fixed Hide fixed

ChenZiHong-Gavin and others added 4 commits December 5, 2025 14:13

fix: reload graph before quizzing

157f0b0

Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…

99a6e5f

…to refactor/refactor-with-ray-data

Potential fix for pull request finding 'Unreachable code'

ec2033b

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

fix: fix quiz params

bc07222

ChenZiHong-Gavin mentioned this pull request Dec 10, 2025

Why is the vLLM backend commented out? #113

Closed

ChenZiHong-Gavin added 2 commits December 10, 2025 15:18

refactor: refactor quiz&judge to ray actors

c9435d7

Merge branch 'refactor/refactor-with-ray-data' of https://github.com/…

c55fc09

…open-sciencelab/GraphGen into refactor/refactor-with-ray-data

github-code-quality bot found potential problems Dec 10, 2025

View reviewed changes

graphgen/operators/partition/partition_service.py Fixed Show fixed Hide fixed

graphgen/operators/partition/partition_service.py Fixed Show fixed Hide fixed

ChenZiHong-Gavin added 2 commits December 10, 2025 19:51

fix: fix transferring quizzed data to JudgeService

d7d6c2a

refactor: refactor partition to accomodate ray data

a6aedaf

github-code-quality bot found potential problems Dec 10, 2025

View reviewed changes

graphgen/operators/partition/partition_service.py Fixed Show fixed Hide fixed

ChenZiHong-Gavin added 5 commits December 10, 2025 22:36

fix: fix lint problem

ea1603b

refactor: refactor op generate

244deb4

feat: write results in output folder

d460a2a

fix: raise error when no dataset is created

cd011ad

fix: return generator in ece_partitioner

aab7438

ChenZiHong-Gavin added 6 commits December 11, 2025 12:30

fix: return generator in ece_partitioner

7643b9f

refactor: refactor data format to support multi-modal input

c42b604

fix: delete fetching schema to avoid ray's duplicate execution

42dc73e

fix: fix operators' registry

73f70a5

feat: refactor schema_guided_extraction & add examples

37cbfcf

feat: seperate ray logs and service logs

b400d2e

ChenZiHong-Gavin marked this pull request as ready for review December 12, 2025 07:26

ChenZiHong-Gavin added 4 commits December 15, 2025 14:43

feat: use storage actor

0790ba4

feat: add kuzu graph database

68e5191

feat: add llm as actors

0fbfcf2

refactor: delete old runner

c7e32b0

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

ChenZiHong-Gavin and others added 6 commits December 16, 2025 00:41

fix: fix vllm wrapper

18a67be

docs: update .env.example

b7d692a

fix: use kuzudb in quiz_service

52519e7

fix: update webui

ee6a927

feat: make storage backend configuragble

86760e9

docs: update README”

9b700f5

ChenZiHong-Gavin merged commit cb04c79 into main Dec 16, 2025
4 checks passed

CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 16, 2025

Merge PR InternScience#110: refactor with ray-data, preserving stream…

9610455

…ing read functionality

CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 16, 2025

Merge main branch (PR InternScience#110) into fix/rna-search-gene-info

e31bc03

CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 17, 2025

Merge PR InternScience#110: refactor with ray-data, preserving stream…

0a390fe

…ing read functionality

CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 17, 2025

Merge main branch (PR InternScience#110) into fix/rna-search-gene-info

7174872

ChenZiHong-Gavin deleted the refactor/refactor-with-ray-data branch December 19, 2025 02:30

refactor: refactor pipeline engine using ray data #110

refactor: refactor pipeline engine using ray data #110

Uh oh!

Conversation

ChenZiHong-Gavin commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChenZiHong-Gavin commented Dec 15, 2025

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Uh oh!

ChenZiHong-Gavin commented Dec 15, 2025

Uh oh!

ChenZiHong-Gavin commented Dec 15, 2025

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Summary of Changes

Highlights

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenZiHong-Gavin commented Dec 3, 2025 •

edited

Loading