feat: support plain text files as prompt datasets by rocker-zhang · Pull Request #337 · p-e-w/heretic

rocker-zhang · 2026-05-30T02:44:19Z

Closes #98.

Revives #103 (closed for author inactivity) on top of the current load_prompts, and resolves the review feedback that was still open there.

A dataset path pointing to a plain file is now read as one prompt per line, with empty lines ignored. For text files column is ignored and split is optional; when given, it selects a subset of lines with the usual slice notation, e.g. [:400].

What changed relative to #103's last state:

text files are detected with os.path.isfile, so files without an extension work
split parsing is factored into one helper (get_split_slice) instead of duplicated
split and column are now optional in DatasetSpecification; the dataset branches raise a clear error when either is missing
an invalid split now raises instead of being silently ignored

A bare slice such as [:400] does not parse with the pinned datasets version, because ReadInstruction.from_spec expects a named split. The text branch prepends a synthetic split name so the documented notation works. I checked [:400], [1:3], [:50%], negative indices, and that malformed slices raise.

No unit tests exist in the repo; validated with ruff format, ruff check, ty, and a local run of load_prompts over a text file covering the cases above.

gemini-code-assist

Code Review

This pull request introduces support for loading prompts from plain text files (one prompt per line) in addition to datasets. It updates the configuration schema to make split and column optional, documents these changes in config.default.toml, and refactors load_prompts in src/heretic/utils.py to handle text files and validate dataset requirements. The review feedback suggests renaming abbreviated variables (name2len and abs_instruction) to comply with the repository style guide, and improving error handling to raise a clear FileNotFoundError when a specified plain text file path does not exist instead of a confusing dataset-related ValueError.

gemini-code-assist · 2026-05-30T02:45:42Z

+    name2len = {split_name: length}
+    # Convert the instructions to absolute indices and select the first one.
+    abs_instruction = ReadInstruction.from_spec(split_str).to_absolute(name2len)[0]
+    return abs_instruction.from_, abs_instruction.to


According to the repository style guide (Rule 1), identifier names should not contain abbreviations unless they are very widely used and understood. The variable names name2len (using len and 2) and abs_instruction (using abs) violate this rule.

Please rename them to name_to_length and absolute_instruction respectively to adhere to the style guide.

Suggested change

name2len = {split_name: length}

# Convert the instructions to absolute indices and select the first one.

abs_instruction = ReadInstruction.from_spec(split_str).to_absolute(name2len)[0]

return abs_instruction.from_, abs_instruction.to

name_to_length = {split_name: length}

# Convert the instructions to absolute indices and select the first one.

absolute_instruction = ReadInstruction.from_spec(split_str).to_absolute(name_to_length)[0]

return absolute_instruction.from_, absolute_instruction.to

References

Identifier names should not contain abbreviations unless those abbreviations are very widely used and understood (e.g. "KL divergence"). ^(link)

gemini-code-assist · 2026-05-30T02:45:42Z

-            revision=specification.commit,
-            split=split_str,
-        )
+    if os.path.isfile(path):


If a user specifies a local plain text file path that does not exist (and does not specify column in their configuration), os.path.isfile(path) will return False. The code will then fall through to the else block and raise a confusing ValueError: The "split" field is required for datasets or The "column" field is required for datasets instead of a clear FileNotFoundError.

To improve usability and error clarity, we can check if specification.column is None and the file does not exist. If so, we can verify if it's a Hugging Face path (to preserve the missing column error for HF datasets) and otherwise raise a FileNotFoundError.

Suggested change

if os.path.isfile(path):

if specification.column is None and not os.path.isfile(path):

is_hf = False

try:

is_hf = is_hf_path(path)

except Exception:

pass

if is_hf:

raise ValueError(f'The "column" field is required for datasets: {path}')

raise FileNotFoundError(f"Plain text file not found: {path}")

if os.path.isfile(path):

rocker-zhang · 2026-05-30T02:59:35Z

Renamed name2len and abs_instruction to name_to_length and absolute_instruction per the style guide.

I left out the FileNotFoundError suggestion: the proposed version wraps is_hf_path in a broad try/except, which is the pattern that was removed in #103. If a clearer error for a missing text-file path is useful, I can add a narrow check that does not swallow exceptions.

p-e-w · 2026-05-30T12:12:34Z

Thanks, I like this! However, since you based this on #103, commit credit to @ricyoung is required. Ideally, start with the commits from #103, then apply your changes as additional commits, but at the absolute minimum, you should add @ricyoung as a co-author on the first commit.

p-e-w

Implementation looks flawless. Not something I've seen very often on the first try!

p-e-w · 2026-05-30T12:20:58Z

    user: str


+def get_split_slice(split_str: str, length: int, split_name: str) -> tuple[int, int]:


Can't we get split_name automatically by splitting split_str before the [?

Can't we get split_name automatically by splitting split_str before the [?

Good call on split_name: get_split_slice now derives it from the spec (split_str.split("[")[0]), so the extra parameter is gone. The text-file branch still prepends a synthetic name for the same reason

A dataset path that points to a plain file is now read as one prompt per line, with empty lines ignored. For text files, "column" is ignored and "split" is optional; when given, it selects a subset of lines using slice notation (e.g. "[:400]"). Detection uses os.path.isfile so files without an extension also work. The split-parsing logic is factored into a shared get_split_slice helper, which derives the split name from the specification, and split/column are now optional in DatasetSpecification, with the dataset branches raising a clear error when either is missing. An invalid split raises instead of being silently ignored. A bare slice does not parse with the pinned datasets version, since ReadInstruction.from_spec expects a named split, so the text branch prepends a synthetic split name. Revives the approach from p-e-w#103. Closes p-e-w#98. Co-authored-by: Ric <ricyoung@gmail.com>

rocker-zhang · 2026-05-30T14:33:18Z

Thanks, I like this! However, since you based this on #103, commit credit to @ricyoung is required. Ideally, start with the commits from #103, then apply your changes as additional commits, but at the absolute minimum, you should add @ricyoung as a co-author on the first commit.

Thanks! Added @ricyoung as co-author on the commit. The change is based on the current load_prompts, which diverged from #103 after the is_hf_path refactor (#325), so a clean replay of the original commits wasn't possible, but the credit is in place

p-e-w · 2026-05-31T09:37:18Z

Yup, that does the trick. Merged, thanks! It's good to have this.

gemini-code-assist Bot reviewed May 30, 2026

View reviewed changes

p-e-w approved these changes May 30, 2026

View reviewed changes

rocker-zhang force-pushed the feat/text-file-datasets branch from 3a03b9f to 5fdb0aa Compare May 30, 2026 14:25

p-e-w merged commit b790094 into p-e-w:master May 31, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support plain text files as prompt datasets#337

feat: support plain text files as prompt datasets#337
p-e-w merged 1 commit into
p-e-w:masterfrom
rocker-zhang:feat/text-file-datasets

rocker-zhang commented May 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

gemini-code-assist Bot May 30, 2026

Uh oh!

rocker-zhang commented May 30, 2026

Uh oh!

p-e-w commented May 30, 2026

Uh oh!

p-e-w left a comment

Uh oh!

p-e-w May 30, 2026

Uh oh!

rocker-zhang May 30, 2026

Uh oh!

rocker-zhang commented May 30, 2026

Uh oh!

Uh oh!

p-e-w commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    if os.path.isfile(path):
+    if specification.column is None and not os.path.isfile(path):
+        is_hf = False
+        try:
+            is_hf = is_hf_path(path)
+        except Exception:
+            pass
+        if is_hf:
+            raise ValueError(f'The "column" field is required for datasets: {path}')
+        raise FileNotFoundError(f"Plain text file not found: {path}")
+    if os.path.isfile(path):

		user: str


		def get_split_slice(split_str: str, length: int, split_name: str) -> tuple[int, int]:

Conversation

rocker-zhang commented May 30, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 30, 2026

Choose a reason for hiding this comment

Uh oh!

rocker-zhang commented May 30, 2026

Uh oh!

p-e-w commented May 30, 2026

Uh oh!

p-e-w left a comment

Choose a reason for hiding this comment

Uh oh!

p-e-w May 30, 2026

Choose a reason for hiding this comment

Uh oh!

rocker-zhang May 30, 2026

Choose a reason for hiding this comment

Uh oh!

rocker-zhang commented May 30, 2026

Uh oh!

Uh oh!

p-e-w commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants