Skip to content

feat: support plain text files as prompt datasets#337

Merged
p-e-w merged 1 commit into
p-e-w:masterfrom
rocker-zhang:feat/text-file-datasets
May 31, 2026
Merged

feat: support plain text files as prompt datasets#337
p-e-w merged 1 commit into
p-e-w:masterfrom
rocker-zhang:feat/text-file-datasets

Conversation

@rocker-zhang
Copy link
Copy Markdown
Contributor

Closes #98.

Revives #103 (closed for author inactivity) on top of the current load_prompts, and resolves the review feedback that was still open there.

A dataset path pointing to a plain file is now read as one prompt per line, with empty lines ignored. For text files column is ignored and split is optional; when given, it selects a subset of lines with the usual slice notation, e.g. [:400].

What changed relative to #103's last state:

  • text files are detected with os.path.isfile, so files without an extension work
  • split parsing is factored into one helper (get_split_slice) instead of duplicated
  • split and column are now optional in DatasetSpecification; the dataset branches raise a clear error when either is missing
  • an invalid split now raises instead of being silently ignored

A bare slice such as [:400] does not parse with the pinned datasets version, because ReadInstruction.from_spec expects a named split. The text branch prepends a synthetic split name so the documented notation works. I checked [:400], [1:3], [:50%], negative indices, and that malformed slices raise.

No unit tests exist in the repo; validated with ruff format, ruff check, ty, and a local run of load_prompts over a text file covering the cases above.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for loading prompts from plain text files (one prompt per line) in addition to datasets. It updates the configuration schema to make split and column optional, documents these changes in config.default.toml, and refactors load_prompts in src/heretic/utils.py to handle text files and validate dataset requirements. The review feedback suggests renaming abbreviated variables (name2len and abs_instruction) to comply with the repository style guide, and improving error handling to raise a clear FileNotFoundError when a specified plain text file path does not exist instead of a confusing dataset-related ValueError.

Comment thread src/heretic/utils.py Outdated
Comment on lines +195 to +198
name2len = {split_name: length}
# Convert the instructions to absolute indices and select the first one.
abs_instruction = ReadInstruction.from_spec(split_str).to_absolute(name2len)[0]
return abs_instruction.from_, abs_instruction.to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to the repository style guide (Rule 1), identifier names should not contain abbreviations unless they are very widely used and understood. The variable names name2len (using len and 2) and abs_instruction (using abs) violate this rule.

Please rename them to name_to_length and absolute_instruction respectively to adhere to the style guide.

Suggested change
name2len = {split_name: length}
# Convert the instructions to absolute indices and select the first one.
abs_instruction = ReadInstruction.from_spec(split_str).to_absolute(name2len)[0]
return abs_instruction.from_, abs_instruction.to
name_to_length = {split_name: length}
# Convert the instructions to absolute indices and select the first one.
absolute_instruction = ReadInstruction.from_spec(split_str).to_absolute(name_to_length)[0]
return absolute_instruction.from_, absolute_instruction.to
References
  1. Identifier names should not contain abbreviations unless those abbreviations are very widely used and understood (e.g. "KL divergence"). (link)

Comment thread src/heretic/utils.py
revision=specification.commit,
split=split_str,
)
if os.path.isfile(path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a user specifies a local plain text file path that does not exist (and does not specify column in their configuration), os.path.isfile(path) will return False. The code will then fall through to the else block and raise a confusing ValueError: The "split" field is required for datasets or The "column" field is required for datasets instead of a clear FileNotFoundError.

To improve usability and error clarity, we can check if specification.column is None and the file does not exist. If so, we can verify if it's a Hugging Face path (to preserve the missing column error for HF datasets) and otherwise raise a FileNotFoundError.

Suggested change
if os.path.isfile(path):
if specification.column is None and not os.path.isfile(path):
is_hf = False
try:
is_hf = is_hf_path(path)
except Exception:
pass
if is_hf:
raise ValueError(f'The "column" field is required for datasets: {path}')
raise FileNotFoundError(f"Plain text file not found: {path}")
if os.path.isfile(path):

@rocker-zhang
Copy link
Copy Markdown
Contributor Author

Renamed name2len and abs_instruction to name_to_length and absolute_instruction per the style guide.

I left out the FileNotFoundError suggestion: the proposed version wraps is_hf_path in a broad try/except, which is the pattern that was removed in #103. If a clearer error for a missing text-file path is useful, I can add a narrow check that does not swallow exceptions.

@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented May 30, 2026

Thanks, I like this! However, since you based this on #103, commit credit to @ricyoung is required. Ideally, start with the commits from #103, then apply your changes as additional commits, but at the absolute minimum, you should add @ricyoung as a co-author on the first commit.

Copy link
Copy Markdown
Owner

@p-e-w p-e-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks flawless. Not something I've seen very often on the first try!

Comment thread src/heretic/utils.py Outdated
user: str


def get_split_slice(split_str: str, length: int, split_name: str) -> tuple[int, int]:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get split_name automatically by splitting split_str before the [?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we get split_name automatically by splitting split_str before the [?

Good call on split_name: get_split_slice now derives it from the spec (split_str.split("[")[0]), so the extra parameter is gone. The text-file branch still prepends a synthetic name for the same reason

A dataset path that points to a plain file is now read as one prompt per
line, with empty lines ignored. For text files, "column" is ignored and
"split" is optional; when given, it selects a subset of lines using slice
notation (e.g. "[:400]").

Detection uses os.path.isfile so files without an extension also work. The
split-parsing logic is factored into a shared get_split_slice helper, which
derives the split name from the specification, and split/column are now
optional in DatasetSpecification, with the dataset branches raising a clear
error when either is missing. An invalid split raises instead of being
silently ignored.

A bare slice does not parse with the pinned datasets version, since
ReadInstruction.from_spec expects a named split, so the text branch prepends
a synthetic split name.

Revives the approach from p-e-w#103.

Closes p-e-w#98.

Co-authored-by: Ric <ricyoung@gmail.com>
@rocker-zhang rocker-zhang force-pushed the feat/text-file-datasets branch from 3a03b9f to 5fdb0aa Compare May 30, 2026 14:25
@rocker-zhang
Copy link
Copy Markdown
Contributor Author

Thanks, I like this! However, since you based this on #103, commit credit to @ricyoung is required. Ideally, start with the commits from #103, then apply your changes as additional commits, but at the absolute minimum, you should add @ricyoung as a co-author on the first commit.

Thanks! Added @ricyoung as co-author on the commit. The change is based on the current load_prompts, which diverged from #103 after the is_hf_path refactor (#325), so a clean replay of the original commits wasn't possible, but the credit is in place

@p-e-w p-e-w merged commit b790094 into p-e-w:master May 31, 2026
4 checks passed
@p-e-w
Copy link
Copy Markdown
Owner

p-e-w commented May 31, 2026

Yup, that does the trick. Merged, thanks! It's good to have this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add option to custom datasets via txt file

2 participants