Skip to content

Improved chunking algorithm#85

Merged
jnu merged 22 commits intomainfrom
chunk
Mar 25, 2025
Merged

Improved chunking algorithm#85
jnu merged 22 commits intomainfrom
chunk

Conversation

@jnu
Copy link
Copy Markdown
Contributor

@jnu jnu commented Mar 17, 2025

  • Refactor code to clean up modules and pipeline
  • Clean up typing so there's less confusion with the ambiguous word "alias" and the ambiguous type NameMap which was used for different purposes in different places. Break these into purpose-specific types with explicit names.
  • Improve the prompting so the LLM also finds these things less ambiguous. Support XML in the prompt to add a lot more clarity to how these entities should be interpreted (at the expense of more tokens).
  • Rewrite chunking algorithm to use a first-class pipeline module of control:chunk which will manage chopping up an input text and feeding it to a redactor. THe redactor can itself be a constrained version of a pipe (via a new control:compose module) which lets us run the existing inspect infrastructure as designed, instead of relying on another custom implementation that's intertwined with the redact module.

To do:

  • Refactor the compose module with the Pipeline to improve both of them, especially w/r/t type checking
  • Ensure that the results of inspect are properly accumulated inside the Context object
  • Ensure that the Context inspect results are properly fed back into the chunk processor
  • Smarter cutting of inputs (refactor from old code)
  • Smarter stitching of outputs (refactor from old code)
  • Tests
  • Cleanup

Fixes #84

ModelMeta(name="gpt-4o-2024-05-13", context=128_000, output=4_096),
ModelMeta(name="gpt-4o-2024-08-06", context=128_000, output=16_384),
ModelMeta(name="gpt-4o-2024-11-20", context=128_000, output=16_384),
ModelMeta(name="o3-mini-2025-01-31", context=200_000, output=100_000),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the idea of output token limits can be a little fuzzy for reasoning models (since some of the tokens may be eaten up by the reasoning phase) — do we account for that somehow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no we don't, but we didn't before either. it's a good point but it's not something I'm trying to solve / look into here.

),
# 2. The original existing text should be *complete*, while the addition is not!
existing_t.original,
# 3. The initial delimiters are empty, so fill them in from the addition.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small nit but I think this comment should say
If the initial delimiters are empty, fill them in from the addition

If I understand correctly?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see why you think that, it's true, but not what I'm emphasizing ... the very first delimiters property we see will be empty because we don't know what the value is and the property must be filled in from the first. In subsequent runs, the property won't be empty and won't be filled in from the addition.

new_text = cast(RedactedText, new).redacted
else:
raise ValueError(f"Unsupported return type: {self.return_type}")
return residual(old_text, new_text, needle_size=window_size)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious from a design perspective, is there any reason to keep this function in common/align.py? This seems like it might be the only place we use it now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't matter to me!


For people in the following list, replace their name or any associated nickname with the pre-specified placeholder. Do not change this placeholder in any way, use it exactly as it was provided.
{preset_aliases}
Replace all names specified by the `RealName` element in the following XML with the pre-specified placeholder given by the `Placeholder` element. Do not change this placeholder in any way, use it exactly as it was provided. Consider variants of the `RealName` if they refer to the same person.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this handle placeholders for non-human entities (e.g., for a given race; or a restaurant or address) between documents?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I should've emphasized not to look at anything outside of chunk.py becuase it's still a WIP as I sort out other issues! the prompt included, this is not ready.

@jnu jnu changed the title [WIP] Improved chunking algorithm Improved chunking algorithm Mar 24, 2025
@jnu jnu merged commit e8ef381 into main Mar 25, 2025
1 check passed
@jnu jnu deleted the chunk branch March 25, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Redacting entities with title information

2 participants