Skip to content

Conversation

@ChenZiHong-Gavin
Copy link
Collaborator

@ChenZiHong-Gavin ChenZiHong-Gavin commented Dec 3, 2025

This PR fundamentally refactors GraphGen's pipeline execution engine from a custom threading-based orchestration system to a Ray Data-native distributed processing framework. The changes improve scalability, resource management, and maintainability while preserving the core pipeline semantics.

@ChenZiHong-Gavin ChenZiHong-Gavin changed the title feat: add config and operator node types refactor: refactor pipeline engine using ray data Dec 3, 2025
ChenZiHong-Gavin and others added 4 commits December 5, 2025 14:13
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
@ChenZiHong-Gavin ChenZiHong-Gavin marked this pull request as ready for review December 12, 2025 07:26
@ChenZiHong-Gavin
Copy link
Collaborator Author

/gemini summary
/gemini review

@gemini-code-assist
Copy link
Contributor

Hi @ChenZiHong-Gavin, it looks like you've tried to invoke the /gemini summary and /gemini review commands within this comment. Please note that these commands need to be posted as separate, top-level comments in the pull request conversation to be processed correctly. I am unable to execute them when they are part of a larger comment body.

@ChenZiHong-Gavin
Copy link
Collaborator Author

/gemini summary

@ChenZiHong-Gavin
Copy link
Collaborator Author

/gemini review

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request introduces a major architectural overhaul to the GraphGen project, transitioning its pipeline execution from a custom threading model to a robust, distributed framework built on Ray Data. This refactoring is designed to enhance the system's scalability, resource efficiency, and maintainability by standardizing operations around Ray's data processing primitives. Key changes include the introduction of a BaseOperator for all pipeline steps, the migration of existing operations to leverage Ray Data, and the centralization of LLM and storage initialization through Ray actors. The configuration schema has also been updated to define dataflow graphs, and the project's file structure has been reorganized to align with the new architecture.

Highlights

  • Pipeline Engine Refactoring: The core pipeline execution engine has been fundamentally refactored to utilize Ray Data, moving away from a custom threading-based orchestration system. This change aims to significantly improve scalability, resource management, and overall maintainability.
  • Introduction of BaseOperator: A new BaseOperator class has been introduced, serving as the foundational interface for all pipeline operations. This standardization facilitates seamless integration with Ray Data's distributed processing capabilities.
  • Migration to Ray Data: All individual pipeline steps, including reading, chunking, knowledge graph building, quizzing, judging, extraction, partitioning, and generation, have been adapted to leverage Ray Data primitives like map_batches, flat_map, filter, and aggregate for distributed execution.
  • Centralized LLM and Storage Management: New init_llm and init_storage functions are now available in graphgen.common. These functions manage the initialization of LLM wrappers and storage backends (including new support for KuzuDB and RocksDB) as Ray actors, enabling efficient distributed access across the pipeline.
  • Updated Configuration Structure: The pipeline configuration (.yaml files) has been updated to reflect the new Ray Data-native approach. Configurations now define nodes with explicit id, op_name, type, dependencies, and execution_params to clearly outline the dataflow graph.
  • Project Structure Reorganization: Numerous example scripts and configuration files have been relocated from scripts/ and resources/ to a new examples/ directory. The graphgen/configs/ directory has been removed, streamlining the project structure and consolidating examples.
Changelog
  • baselines/BDS/bds.py
    • Updated import path for init_llm from graphgen.operators to graphgen.common.
    • Simplified init_llm call within the BDS class constructor.
  • examples/extract/extract_schema_guided/README.md
    • New file added, providing documentation for schema-guided extraction.
  • examples/extract/extract_schema_guided/extract_schema_guided.sh
    • New file added, containing a shell script to run schema-guided extraction.
  • examples/extract/extract_schema_guided/schema_guided_extraction_config.yaml
    • New file added, defining the configuration for schema-guided extraction using the new pipeline structure.
  • examples/generate/generate_aggregated_qa/README.md
    • New file added, providing documentation for aggregated QA generation.
  • examples/generate/generate_aggregated_qa/aggregated_config.yaml
    • New file added, defining the configuration for aggregated QA generation using the new pipeline structure.
  • examples/generate/generate_aggregated_qa/generate_aggregated.sh
    • New file added, containing a shell script to run aggregated QA generation.
  • examples/generate/generate_atomic_qa/README.md
    • New file added, providing documentation for atomic QA generation.
  • examples/generate/generate_atomic_qa/atomic_config.yaml
    • New file added, defining the configuration for atomic QA generation using the new pipeline structure.
  • examples/generate/generate_atomic_qa/generate_atomic.sh
    • New file added, containing a shell script to run atomic QA generation.
  • examples/generate/generate_cot_qa/README.md
    • New file added, providing documentation for CoT QA generation.
  • examples/generate/generate_cot_qa/cot_config.yaml
    • New file added, defining the configuration for CoT QA generation using the new pipeline structure.
  • examples/generate/generate_cot_qa/generate_cot.sh
    • New file added, containing a shell script to run CoT QA generation.
  • examples/generate/generate_multi_hop_qa/README.md
    • New file added, providing documentation for multi-hop QA generation.
  • examples/generate/generate_multi_hop_qa/generate_multi_hop.sh
    • New file added, containing a shell script to run multi-hop QA generation.
  • examples/generate/generate_multi_hop_qa/multi_hop_config.yaml
    • New file added, defining the configuration for multi-hop QA generation using the new pipeline structure.
  • examples/generate/generate_vqa/README.md
    • New file added, providing documentation for VQA generation.
  • examples/generate/generate_vqa/generate_vqa.sh
    • New file added, containing a shell script to run VQA generation.
  • examples/generate/generate_vqa/vqa_config.yaml
    • New file added, defining the configuration for VQA generation using the new pipeline structure.
  • graphgen/bases/init.py
    • Added BaseOperator import.
    • Removed BaseListStorage import.
    • Added Config and Node to datatypes import.
  • graphgen/bases/base_llm_wrapper.py
    • Removed shutdown and restart methods, as LLM services are now managed as Ray actors.
  • graphgen/bases/base_operator.py
    • New file added, defining the abstract BaseOperator class for all pipeline operations, including Ray Data integration and context-aware logging.
  • graphgen/bases/base_partitioner.py
    • Changed partition method from asynchronous to synchronous.
    • Modified community2batch to process a single community and operate synchronously.
  • graphgen/bases/base_reader.py
    • Updated read method signature to return ray.data.Dataset and accept Union[str, List[str]] for Ray Data compatibility.
    • Replaced the static filter method with instance methods _should_keep_item and _validate_batch for data validation and filtering within Ray Data.
    • Added modalities parameter to the constructor.
  • graphgen/bases/base_splitter.py
    • Updated logger import path.
    • Changed terminology from 'split' to 'chunk' in docstrings for clarity.
  • graphgen/bases/base_storage.py
    • Removed BaseListStorage class.
    • Added reload method to BaseKVStorage and BaseGraphStorage interfaces.
  • graphgen/bases/datatypes.py
    • Added Node and Config Pydantic models for defining pipeline nodes and overall configurations.
  • graphgen/common/init.py
    • New file added, consolidating init_llm and init_storage imports.
  • graphgen/common/init_llm.py
    • New file added, providing LLMServiceActor, LLMServiceProxy, and LLMFactory for distributed LLM management as Ray actors.
  • graphgen/common/init_storage.py
    • New file added, providing KVStorageActor, GraphStorageActor, RemoteKVStorageProxy, RemoteGraphStorageProxy, and StorageFactory for distributed storage management as Ray actors.
  • graphgen/configs/README.md
    • File removed due to restructuring of configurations into examples/.
  • graphgen/configs/init.py
    • File renamed to graphgen/models/storage/graph/__init__.py.
  • graphgen/configs/aggregated_config.yaml
    • File removed due to restructuring of configurations into examples/.
  • graphgen/configs/atomic_config.yaml
    • File removed due to restructuring of configurations into examples/.
  • graphgen/configs/cot_config.yaml
    • File removed due to restructuring of configurations into examples/.
  • graphgen/configs/multi_hop_config.yaml
    • File removed due to restructuring of configurations into examples/.
  • graphgen/configs/schema_guided_extraction_config.yaml
    • File removed due to restructuring of configurations into examples/.
  • graphgen/configs/search_dna_config.yaml
    • File renamed to examples/search/search_dna_config.yaml.
  • graphgen/configs/search_protein_config.yaml
    • File renamed to examples/search/search_protein_config.yaml.
  • graphgen/configs/search_rna_config.yaml
    • File renamed to examples/search/search_rna_config.yaml.
  • graphgen/configs/vqa_config.yaml
    • File removed due to restructuring of configurations into examples/.
  • graphgen/engine.py
    • Old Context and OpNode classes removed.
    • The Engine class has been completely refactored to orchestrate Ray Data pipelines, including topological sorting of nodes and execution of Ray Data operations (map_batches, filter, flatmap, aggregate).
  • graphgen/evaluate.py
    • File renamed to graphgen/operators/evaluate/evaluate.py.
    • Updated imports for LengthEvaluator, MTLDEvaluator, RewardEvaluator, UniEvaluator to reflect their new location in graphgen.models.
  • graphgen/graphgen.py
    • File removed, as its functionality is replaced by the new Ray Data-native Engine and operator services.
  • graphgen/models/init.py
    • Removed JSONLReader import.
    • Added KuzuStorage and RocksDBKVStorage imports.
    • Removed JsonListStorage import.
  • graphgen/models/extractor/schema_guided_extractor.py
    • Adjusted how _chunk_id and text are accessed from the input chunk dictionary.
    • Changed merge_extractions method from asynchronous to synchronous.
  • graphgen/models/generator/vqa_generator.py
    • Updated the key for accessing image data from 'images' to 'image_data' in node properties.
  • graphgen/models/llm/local/sglang_wrapper.py
    • Removed shutdown and restart methods, as LLM services are now managed externally as Ray actors.
  • graphgen/models/partitioner/anchor_bfs_partitioner.py
    • Changed partition, _pick_anchor_ids, and _grow_community methods from asynchronous to synchronous.
    • Modified partition to yield Community objects iteratively instead of returning a complete list.
  • graphgen/models/partitioner/bfs_partitioner.py
    • Changed partition method from asynchronous to synchronous.
    • Modified partition to yield Community objects iteratively instead of returning a complete list.
  • graphgen/models/partitioner/dfs_partitioner.py
    • Changed partition method from asynchronous to synchronous.
    • Modified partition to yield Community objects iteratively instead of returning a complete list.
  • graphgen/models/partitioner/ece_partitioner.py
    • Updated imports (tqdm.asyncio to tqdm, asyncio removed).
    • Changed partition and _grow_community methods from asynchronous to synchronous.
    • Modified partition to yield Community objects iteratively instead of returning a complete list.
  • graphgen/models/partitioner/leiden_partitioner.py
    • Changed partition, _run_leiden, and _split_communities methods from asynchronous to synchronous.
  • graphgen/models/reader/init.py
    • Removed JSONLReader import.
  • graphgen/models/reader/csv_reader.py
    • Updated read method to use ray.data.read_csv and integrate _validate_batch and _should_keep_item for Ray Data processing.
    • Changed return type to ray.data.Dataset.
  • graphgen/models/reader/json_reader.py
    • Refactored read method to handle both JSON and JSONL formats, utilize ray.data.read_json, and integrate _validate_batch and _should_keep_item.
    • Added a static method _unify_schema for consistent JSON data processing.
    • Changed return type to ray.data.Dataset.
  • graphgen/models/reader/jsonl_reader.py
    • File removed, as its functionality is now integrated into json_reader.py.
  • graphgen/models/reader/parquet_reader.py
    • Updated read method to use ray.data.read_parquet and integrate _validate_batch and _should_keep_item.
    • Changed return type to ray.data.Dataset.
  • graphgen/models/reader/pdf_reader.py
    • Updated read method to use Ray Data for parallel processing of PDF files.
    • Changed return type to ray.data.Dataset.
    • Modified _try_load_cached_result to correctly adjust file paths and content.
  • graphgen/models/reader/pickle_reader.py
    • Updated read method to use ray.data.read_binary_files for pickle files, including deserialization and validation logic.
    • Changed return type to ray.data.Dataset.
  • graphgen/models/reader/rdf_reader.py
    • Updated read method to use Ray Data for parallel processing of RDF files.
    • Changed return type to ray.data.Dataset.
    • Added _parse_rdf_file for single file processing and improved error handling.
  • graphgen/models/reader/txt_reader.py
    • Updated read method to use ray.data.read_binary_files for text files.
    • Changed return type to ray.data.Dataset.
  • graphgen/models/splitter/character_splitter.py
    • Changed terminology from 'split' to 'chunk' in docstring.
  • graphgen/models/splitter/markdown_splitter.py
    • Changed terminology from 'split' to 'chunk' in docstrings.
  • graphgen/models/splitter/recursive_character_splitter.py
    • Changed terminology from 'split' to 'chunk' in docstrings.
  • graphgen/models/storage/graph/kuzu_storage.py
    • New file added, implementing KuzuStorage as a BaseGraphStorage backend using KuzuDB for graph persistence.
  • graphgen/models/storage/json_storage.py
    • File renamed to graphgen/models/storage/kv/json_storage.py.
    • Removed BaseListStorage import and the JsonListStorage class.
    • Removed logger imports and replaced logger.info calls with print statements.
    • Added a reload method to re-load data from the JSON file.
  • graphgen/models/storage/kv/rocksdb_storage.py
    • New file added, implementing RocksDBKVStorage as a BaseKVStorage backend using RocksDB for key-value persistence.
  • graphgen/models/storage/networkx_storage.py
    • File renamed to graphgen/models/storage/graph/networkx_storage.py.
    • Removed logger imports and replaced logger.info/logger.warning calls with print statements.
    • Added a reload method to re-initialize the graph from its GraphML file.
  • graphgen/operators/init.py
    • Replaced individual operator imports with new service classes (e.g., BuildKGService, ChunkService).
    • Introduced an operators dictionary mapping operator names to their respective service classes or functions for dynamic lookup.
  • graphgen/operators/build_kg/init.py
    • Updated import to BuildKGService.
  • graphgen/operators/build_kg/build_kg.py
    • File removed, replaced by build_kg_service.py.
  • graphgen/operators/build_kg/build_kg_service.py
    • New file added, implementing BuildKGService as a BaseOperator for building knowledge graphs, integrating LLM and storage initialization.
  • graphgen/operators/build_kg/build_mm_kg.py
    • Removed gradio import and progress_bar parameter from function signature.
    • Changed build_mm_kg and internal run_concurrent calls from asynchronous to synchronous.
  • graphgen/operators/build_kg/build_text_kg.py
    • Removed gradio import and progress_bar parameter from function signature.
    • Changed build_text_kg and internal run_concurrent calls from asynchronous to synchronous.
  • graphgen/operators/chunk/init.py
    • New file added, importing ChunkService.
  • graphgen/operators/chunk/chunk_service.py
    • New file added, implementing ChunkService as a BaseOperator for document chunking, including language detection and storage integration.
  • graphgen/operators/extract/init.py
    • Updated import to ExtractService.
  • graphgen/operators/extract/extract_info.py
    • File removed, replaced by extract_service.py.
  • graphgen/operators/extract/extract_service.py
    • New file added, implementing ExtractService as a BaseOperator for information extraction, supporting schema-guided methods.
  • graphgen/operators/generate/init.py
    • Updated import to GenerateService.
  • graphgen/operators/generate/generate_qas.py
    • File removed, replaced by generate_service.py.
  • graphgen/operators/generate/generate_service.py
    • New file added, implementing GenerateService as a BaseOperator for QA generation, supporting various methods like atomic, aggregated, multi-hop, CoT, and VQA.
  • graphgen/operators/init/init.py
    • File removed, as init_llm is now in graphgen.common.
  • graphgen/operators/init/init_llm.py
    • File removed, as its functionality is now in graphgen/common/init_llm.py.
  • graphgen/operators/judge/init.py
    • New file added, importing JudgeService.
  • graphgen/operators/judge/judge_service.py
    • New file added, implementing JudgeService as a BaseOperator for judging graph elements and computing comprehension loss using a trainee LLM.
  • graphgen/operators/partition/init.py
    • Updated import to PartitionService.
  • graphgen/operators/partition/partition_kg.py
    • File removed, replaced by partition_service.py.
  • graphgen/operators/partition/partition_service.py
    • New file added, implementing PartitionService as a BaseOperator for partitioning knowledge graphs using various methods (BFS, DFS, ECE, Leiden, Anchor BFS). Includes pre-tokenization and data attachment logic.
  • graphgen/operators/partition/pre_tokenize.py
    • File removed, as its functionality is now integrated into partition_service.py.
  • graphgen/operators/quiz/init.py
    • New file added, importing QuizService.
  • graphgen/operators/quiz/quiz_service.py
    • New file added, implementing QuizService as a BaseOperator for generating quizzes based on graph elements and storing them.
  • graphgen/operators/quiz_and_judge/init.py
    • File removed, as quiz and judge operations are now separate services.
  • graphgen/operators/quiz_and_judge/judge.py
    • File removed, replaced by judge_service.py.
  • graphgen/operators/quiz_and_judge/quiz.py
    • File removed, replaced by quiz_service.py.
  • graphgen/operators/read/init.py
    • Updated import to read.
  • graphgen/operators/read/parallel_file_scanner.py
    • Updated RocksDBCache path.
    • Removed explicit logger calls, relying on the new context-aware logger.
    • Removed logger.warning/logger.error/logger.info calls, relying on the new context-aware logger.
  • graphgen/operators/read/read.py
    • New file added, providing a unified read function that leverages ParallelFileScanner and various BaseReader implementations to read files into Ray Data.
  • graphgen/operators/read/read_files.py
    • File removed, replaced by read.py.
  • graphgen/operators/split/init.py
    • File removed, as chunking is now handled by chunk_service.py.
  • graphgen/operators/split/split_chunks.py
    • File removed, replaced by chunk_service.py.
  • graphgen/operators/storage.py
    • File removed, as storage management is now handled by graphgen/common/init_storage.py and individual operator services.
  • graphgen/run.py
    • Updated config_file default path using importlib.resources for better package resource handling.
    • Removed imports for Context, GraphGen, and collect_ops.
    • Imported operators from graphgen.operators.
    • Refactored the main function to use the new Engine and Ray Data for pipeline execution and output.
    • Introduced NodeFilenameProvider for structured output of Ray Data results.
    • Updated logger initialization to use CURRENT_LOGGER_VAR for context-aware logging.
  • graphgen/utils/init.py
    • Updated log import to include CURRENT_LOGGER_VAR.
  • graphgen/utils/log.py
    • Refactored set_logger for more flexible logger initialization with custom names and force re-initialization.
    • Introduced CURRENT_LOGGER_VAR (a contextvars.ContextVar) and ContextAwareLogger to manage logger instances in a Ray-compatible, context-aware manner.
    • Replaced the global logger instance with an instance of ContextAwareLogger.
  • graphgen/utils/run_concurrent.py
    • Removed gradio import and progress_bar parameter from function signature.
    • Refactored run_concurrent to be synchronous, internally managing its own asyncio event loop.
  • requirements.txt
    • Added pydantic for data validation and settings management.
    • Added ray==2.52.1 to support the new distributed execution framework.
  • resources/input_examples/csv_demo.csv
    • File renamed to examples/input_examples/csv_demo.csv.
  • resources/input_examples/extract_demo.txt
    • File renamed to examples/input_examples/extract_demo.txt.
  • resources/input_examples/graphml_demo.graphml
    • File renamed to examples/input_examples/graphml_demo.graphml.
  • resources/input_examples/json_demo.json
    • File renamed to examples/input_examples/json_demo.json.
  • resources/input_examples/jsonl_demo.jsonl
    • File renamed to examples/input_examples/jsonl_demo.jsonl.
  • resources/input_examples/search_dna_demo.jsonl
    • File renamed to examples/input_examples/search_dna_demo.jsonl.
  • resources/input_examples/search_protein_demo.jsonl
    • File renamed to examples/input_examples/search_protein_demo.jsonl.
  • resources/input_examples/search_rna_demo.jsonl
    • File renamed to examples/input_examples/search_rna_demo.jsonl.
  • resources/input_examples/txt_demo.txt
    • File renamed to examples/input_examples/txt_demo.txt.
  • resources/input_examples/vqa_demo.json
    • File renamed to examples/input_examples/vqa_demo.json.
    • Modified image and table content structure to nest img_path, image_caption, table_caption, table_footnote, and table_body under a content key for consistency.
  • resources/output_examples/aggregated_chatml.json
    • File renamed to examples/output_examples/aggregated_chatml.json.
  • resources/output_examples/atomic_alpaca.json
    • File renamed to examples/output_examples/atomic_alpaca.json.
  • resources/output_examples/cot_sharegpt.json
    • File renamed to examples/output_examples/cot_sharegpt.json.
  • resources/output_examples/multi-hop_chatml.json
    • File renamed to examples/output_examples/multi-hop_chatml.json.
  • scripts/baselines/generate_all_baselines.sh
    • File renamed to examples/baselines/generate_all_baselines.sh.
  • scripts/baselines/generate_bds.sh
    • File renamed to examples/baselines/generate_bds.sh.
  • scripts/baselines/generate_entigraph.sh
    • File renamed to examples/baselines/generate_entigraph.sh.
  • scripts/baselines/generate_genie.sh
    • File renamed to examples/baselines/generate_genie.sh.
  • scripts/baselines/generate_longform.sh
    • File renamed to examples/baselines/generate_longform.sh.
  • scripts/baselines/generate_selfqa.sh
    • File renamed to examples/baselines/generate_selfqa.sh.
  • scripts/baselines/generate_wrap.sh
    • File renamed to examples/baselines/generate_wrap.sh.
  • scripts/evaluate/evaluate.sh
    • File renamed to examples/evaluate/evaluate.sh.
  • scripts/extract/extract_schema_guided.sh
    • File removed, replaced by examples/extract/extract_schema_guided/extract_schema_guided.sh.
  • scripts/generate/generate_aggregated.sh
    • File removed, replaced by examples/generate/generate_aggregated_qa/generate_aggregated.sh.
  • scripts/generate/generate_atomic.sh
    • File removed, replaced by examples/generate/generate_atomic_qa/generate_atomic.sh.
  • scripts/generate/generate_cot.sh
    • File removed, replaced by examples/generate/generate_cot_qa/generate_cot.sh.
  • scripts/generate/generate_multi_hop.sh
    • File removed, replaced by examples/generate/generate_multi_hop_qa/generate_multi_hop.sh.
  • scripts/generate/generate_vqa.sh
    • File removed, replaced by examples/generate/generate_vqa/generate_vqa.sh.
  • scripts/search/build_db/build_dna_blast_db.sh
    • File renamed to examples/search/build_db/build_dna_blast_db.sh.
  • scripts/search/build_db/build_protein_blast_db.sh
    • File renamed to examples/search/build_db/build_protein_blast_db.sh.
  • scripts/search/build_db/build_rna_blast_db.sh
    • File renamed to examples/search/build_db/build_rna_blast_db.sh.
  • scripts/search/search_dna.sh
    • File renamed to examples/search/search_dna.sh.
  • scripts/search/search_rna.sh
    • File renamed to examples/search/search_rna.sh.
  • scripts/search/search_uniprot.sh
    • File renamed to examples/search/search_uniprot.sh.
Activity
  • ChenZiHong-Gavin initially requested a summary and review in a single comment.
  • gemini-code-assist[bot] clarified that /gemini summary and /gemini review commands must be posted as separate, top-level comments.
  • ChenZiHong-Gavin subsequently posted /gemini summary as a separate comment.
  • ChenZiHong-Gavin then posted /gemini review as another separate comment.
  • github-code-quality[bot] identified an issue in chunk_documents regarding mixed explicit and implicit returns, suggesting an explicit return for empty input.
  • github-code-quality[bot] found unreachable code in graphgen/engine.py related to an elif branch that was always subsumed by a preceding if.
  • github-code-quality[bot] reported multiple instances of an unsupported progress_bar keyword argument being passed to run_concurrent in graphgen/operators/build_kg/build_mm_kg.py and graphgen/operators/build_kg/build_text_kg.py.
  • github-code-quality[bot] also noted the same progress_bar issue in run_concurrent calls within graphgen/operators/partition/partition_service.py.
  • github-code-quality[bot] identified an unused import of Any in graphgen/operators/partition/partition_service.py.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an impressive and substantial refactoring of the pipeline execution engine. Migrating from a custom threading-based system to a Ray Data-native framework is a significant improvement for scalability, maintainability, and robustness. The new architecture with a declarative YAML configuration, a Ray Data-based engine, and modular operators is well-designed. The use of Ray actors for managing stateful services like LLMs and storage is a solid pattern for distributed execution.

My review includes a few suggestions to further improve robustness and performance. Specifically, I've commented on improving error handling in data processing, making logging more consistent in storage classes, and optimizing database write patterns.

Overall, this is an excellent piece of work that modernizes the project's core infrastructure.

Comment on lines +28 to +46
def _should_keep_item(self, item: Dict[str, Any]) -> bool:
"""
Determine whether to keep the given item based on the text column.
:param item: Dictionary representing a data entry.
:return: True if the item should be kept, False otherwise.
"""
Filter out entries with empty or missing text in the specified column.
item_type = item.get("type")
assert item_type in [
"text",
"image",
"table",
"equation",
"protein",
], f"Unsupported item type: {item_type}"
if item_type == "text":
content = item.get(self.text_column, "").strip()
return bool(content)
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In _should_keep_item, using assert to validate item_type will crash the Ray worker on unsupported types. It's more robust to log a warning and return False to filter out invalid items without halting the pipeline.

    def _should_keep_item(self, item: Dict[str, Any]) -> bool:
        """
        Determine whether to keep the given item based on the text column.

        :param item: Dictionary representing a data entry.
        :return: True if the item should be kept, False otherwise.
        """
        item_type = item.get("type")
        if item_type not in {
            "text",
            "image",
            "table",
            "equation",
            "protein",
        }:
            # Consider logging a warning for visibility on data quality issues.
            return False

        if item_type == "text":
            content = item.get(self.text_column, "").strip()
            return bool(content)
        return True

self._graph.nodes[node_id].update(node_data)
else:
logger.warning("Node %s not found in the graph for update.", node_id)
print(f"Node {node_id} not found in the graph for update.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Replacing logger.warning with print is a regression in logging practices. print bypasses logging configuration (levels, formatters, handlers), making it difficult to control log output in different environments. It's better to use the standard logging module. Since the context-aware logger might not be available in this class, importing logging and using logging.warning(...) would be a more robust solution. This feedback applies to other print statements in this file and other storage classes (e.g., kuzu_storage.py, json_storage.py).

Comment on lines +40 to +44
except Exception as e: # pylint: disable=broad-except
logger.error("Error in judging description: %s", e)
logger.info("Use default loss 0.1")
item["loss"] = -math.log(0.1)
return item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can hide bugs and make debugging difficult. If the LLM client library defines more specific exceptions (e.g., for API errors, timeouts, or content filtering), it's better to catch those explicitly. This allows for more granular error handling and reporting. If specific exceptions are not available, consider using logger.exception() to automatically include traceback information in the log, which is very helpful for debugging.

Comment on lines +89 to +122
def _pre_tokenize(self) -> None:
"""Pre-tokenize all nodes and edges to add token length information."""
logger.info("Starting pre-tokenization of nodes and edges...")

nodes = self.kg_instance.get_all_nodes()
edges = self.kg_instance.get_all_edges()

# Process nodes
for node_id, node_data in nodes:
if "length" not in node_data:
try:
description = node_data.get("description", "")
tokens = self.tokenizer_instance.encode(description)
node_data["length"] = len(tokens)
self.kg_instance.update_node(node_id, node_data)
except Exception as e:
logger.warning("Failed to tokenize node %s: %s", node_id, e)
node_data["length"] = 0

# Process edges
for u, v, edge_data in edges:
if "length" not in edge_data:
try:
description = edge_data.get("description", "")
tokens = self.tokenizer_instance.encode(description)
edge_data["length"] = len(tokens)
self.kg_instance.update_edge(u, v, edge_data)
except Exception as e:
logger.warning("Failed to tokenize edge %s-%s: %s", u, v, e)
edge_data["length"] = 0

# Persist changes
self.kg_instance.index_done_callback()
logger.info("Pre-tokenization completed.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _pre_tokenize method updates nodes and edges one by one inside loops. This can lead to a large number of individual write operations to the storage backend, which is inefficient, especially for database-backed storages like KuzuDB. It would be more performant to batch these updates. Consider collecting all node and edge updates into lists and then performing a bulk update operation if the storage backend supports it. This might require adding bulk update methods to the BaseGraphStorage interface for better performance.

@ChenZiHong-Gavin ChenZiHong-Gavin merged commit cb04c79 into main Dec 16, 2025
4 checks passed
CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 16, 2025
CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 16, 2025
CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 17, 2025
CHERRY-ui8 pushed a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 17, 2025
* feat: add config and operator node types

* refactor: refactor readers with ray data

* fix: delete param parallelism for readers

* fix: fix import error

* refactor read and chunk operators with no side effects

* fix: fix import error

* fix: fix return logic

* refactor: rename operator split to chunk

* refactor: refactor build_kg to accomodate ray data

* feat: add StorageFactory & global params

* refactor: refactor quiz to accomodata ray data engine

* fix: reload graph before quizzing

* Potential fix for pull request finding 'Unreachable code'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

* fix: fix quiz params

* refactor: refactor quiz&judge to ray actors

* fix: fix transferring quizzed data to JudgeService

* refactor: refactor partition to accomodate ray data

* fix: fix lint problem

* refactor: refactor op generate

* feat: write results in output folder

* fix: raise error when no dataset is created

* fix: return generator in ece_partitioner

* fix: return generator in ece_partitioner

* refactor: refactor data format to support multi-modal input

* fix: delete fetching schema to avoid ray's duplicate execution

* fix: fix operators' registry

* feat: refactor schema_guided_extraction & add examples

* feat: seperate ray logs and service logs

* feat: use storage actor

* feat: add kuzu graph database

* feat: add llm as actors

* refactor: delete old runner

* fix: fix vllm wrapper

* docs: update .env.example

* fix: use kuzudb in quiz_service

* fix: update webui

* feat: make storage backend configuragble

* docs: update README”

---------

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
CHERRY-ui8 added a commit to CHERRY-ui8/GraphGen that referenced this pull request Dec 17, 2025
@ChenZiHong-Gavin ChenZiHong-Gavin deleted the refactor/refactor-with-ray-data branch December 19, 2025 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants