feat: support plain text files as prompt datasets#337
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for loading prompts from plain text files (one prompt per line) in addition to datasets. It updates the configuration schema to make split and column optional, documents these changes in config.default.toml, and refactors load_prompts in src/heretic/utils.py to handle text files and validate dataset requirements. The review feedback suggests renaming abbreviated variables (name2len and abs_instruction) to comply with the repository style guide, and improving error handling to raise a clear FileNotFoundError when a specified plain text file path does not exist instead of a confusing dataset-related ValueError.
| name2len = {split_name: length} | ||
| # Convert the instructions to absolute indices and select the first one. | ||
| abs_instruction = ReadInstruction.from_spec(split_str).to_absolute(name2len)[0] | ||
| return abs_instruction.from_, abs_instruction.to |
There was a problem hiding this comment.
According to the repository style guide (Rule 1), identifier names should not contain abbreviations unless they are very widely used and understood. The variable names name2len (using len and 2) and abs_instruction (using abs) violate this rule.
Please rename them to name_to_length and absolute_instruction respectively to adhere to the style guide.
| name2len = {split_name: length} | |
| # Convert the instructions to absolute indices and select the first one. | |
| abs_instruction = ReadInstruction.from_spec(split_str).to_absolute(name2len)[0] | |
| return abs_instruction.from_, abs_instruction.to | |
| name_to_length = {split_name: length} | |
| # Convert the instructions to absolute indices and select the first one. | |
| absolute_instruction = ReadInstruction.from_spec(split_str).to_absolute(name_to_length)[0] | |
| return absolute_instruction.from_, absolute_instruction.to |
References
- Identifier names should not contain abbreviations unless those abbreviations are very widely used and understood (e.g. "KL divergence"). (link)
| revision=specification.commit, | ||
| split=split_str, | ||
| ) | ||
| if os.path.isfile(path): |
There was a problem hiding this comment.
If a user specifies a local plain text file path that does not exist (and does not specify column in their configuration), os.path.isfile(path) will return False. The code will then fall through to the else block and raise a confusing ValueError: The "split" field is required for datasets or The "column" field is required for datasets instead of a clear FileNotFoundError.
To improve usability and error clarity, we can check if specification.column is None and the file does not exist. If so, we can verify if it's a Hugging Face path (to preserve the missing column error for HF datasets) and otherwise raise a FileNotFoundError.
| if os.path.isfile(path): | |
| if specification.column is None and not os.path.isfile(path): | |
| is_hf = False | |
| try: | |
| is_hf = is_hf_path(path) | |
| except Exception: | |
| pass | |
| if is_hf: | |
| raise ValueError(f'The "column" field is required for datasets: {path}') | |
| raise FileNotFoundError(f"Plain text file not found: {path}") | |
| if os.path.isfile(path): |
|
Renamed I left out the |
p-e-w
left a comment
There was a problem hiding this comment.
Implementation looks flawless. Not something I've seen very often on the first try!
| user: str | ||
|
|
||
|
|
||
| def get_split_slice(split_str: str, length: int, split_name: str) -> tuple[int, int]: |
There was a problem hiding this comment.
Can't we get split_name automatically by splitting split_str before the [?
There was a problem hiding this comment.
Can't we get
split_nameautomatically by splittingsplit_strbefore the[?
Good call on split_name: get_split_slice now derives it from the spec (split_str.split("[")[0]), so the extra parameter is gone. The text-file branch still prepends a synthetic name for the same reason
A dataset path that points to a plain file is now read as one prompt per line, with empty lines ignored. For text files, "column" is ignored and "split" is optional; when given, it selects a subset of lines using slice notation (e.g. "[:400]"). Detection uses os.path.isfile so files without an extension also work. The split-parsing logic is factored into a shared get_split_slice helper, which derives the split name from the specification, and split/column are now optional in DatasetSpecification, with the dataset branches raising a clear error when either is missing. An invalid split raises instead of being silently ignored. A bare slice does not parse with the pinned datasets version, since ReadInstruction.from_spec expects a named split, so the text branch prepends a synthetic split name. Revives the approach from p-e-w#103. Closes p-e-w#98. Co-authored-by: Ric <ricyoung@gmail.com>
3a03b9f to
5fdb0aa
Compare
Thanks! Added @ricyoung as co-author on the commit. The change is based on the current load_prompts, which diverged from #103 after the is_hf_path refactor (#325), so a clean replay of the original commits wasn't possible, but the credit is in place |
|
Yup, that does the trick. Merged, thanks! It's good to have this. |
Closes #98.
Revives #103 (closed for author inactivity) on top of the current
load_prompts, and resolves the review feedback that was still open there.A
datasetpath pointing to a plain file is now read as one prompt per line, with empty lines ignored. For text filescolumnis ignored andsplitis optional; when given, it selects a subset of lines with the usual slice notation, e.g.[:400].What changed relative to #103's last state:
os.path.isfile, so files without an extension workget_split_slice) instead of duplicatedsplitandcolumnare now optional inDatasetSpecification; the dataset branches raise a clear error when either is missingA bare slice such as
[:400]does not parse with the pinneddatasetsversion, becauseReadInstruction.from_specexpects a named split. The text branch prepends a synthetic split name so the documented notation works. I checked[:400],[1:3],[:50%], negative indices, and that malformed slices raise.No unit tests exist in the repo; validated with
ruff format,ruff check,ty, and a local run ofload_promptsover a text file covering the cases above.