Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions graphgen/operators/read/read.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ def _build_reader(suffix: str, cache_dir: str | None, **reader_kwargs):
def read(
input_path: Union[str, List[str]],
allowed_suffix: Optional[List[str]] = None,
cache_dir: Optional[str] = "cache",
working_dir: Optional[str] = "cache",
parallelism: int = 4,
recursive: bool = True,
**reader_kwargs: Any,
Expand All @@ -60,7 +60,7 @@ def read(

:param input_path: File or directory path(s) to read from
:param allowed_suffix: List of allowed file suffixes (e.g., ['pdf', 'txt'])
:param cache_dir: Directory to cache intermediate files (PDF processing)
:param working_dir: Directory to cache intermediate files (PDF processing)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for working_dir mentions only "PDF processing". However, this directory is also used by ParallelFileScanner for caching file scan results. To make the documentation more accurate, I suggest updating it to reflect its broader usage.

Suggested change
:param working_dir: Directory to cache intermediate files (PDF processing)
:param working_dir: Directory for caching intermediate files (e.g., for file scanning and PDF processing)

:param parallelism: Number of parallel workers
:param recursive: Whether to scan directories recursively
:param reader_kwargs: Additional kwargs passed to readers
Expand All @@ -70,7 +70,7 @@ def read(
# 1. Scan all paths to discover files
logger.info("[READ] Scanning paths: %s", input_path)
scanner = ParallelFileScanner(
cache_dir=cache_dir,
cache_dir=working_dir,
allowed_suffix=allowed_suffix,
rescan=False,
max_workers=parallelism if parallelism > 0 else 1,
Expand Down Expand Up @@ -100,7 +100,7 @@ def read(
# 3. Create read tasks
read_tasks = []
for suffix, file_paths in files_by_suffix.items():
reader = _build_reader(suffix, cache_dir, **reader_kwargs)
reader = _build_reader(suffix, working_dir, **reader_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While you've correctly passed working_dir to _build_reader, the parameter in the _build_reader function definition is still named cache_dir. For better consistency and to complete the refactoring within this file, consider renaming the cache_dir parameter to working_dir in the _build_reader function signature and its body as well. This would improve internal consistency and readability.

ds = reader.read(file_paths)
read_tasks.append(ds)

Expand Down