Improved chunking algorithm by jnu · Pull Request #85 · comppolicylab/bc2

jnu · 2025-03-17T16:30:21Z

Refactor code to clean up modules and pipeline
Clean up typing so there's less confusion with the ambiguous word "alias" and the ambiguous type NameMap which was used for different purposes in different places. Break these into purpose-specific types with explicit names.
Improve the prompting so the LLM also finds these things less ambiguous. Support XML in the prompt to add a lot more clarity to how these entities should be interpreted (at the expense of more tokens).
Rewrite chunking algorithm to use a first-class pipeline module of control:chunk which will manage chopping up an input text and feeding it to a redactor. THe redactor can itself be a constrained version of a pipe (via a new control:compose module) which lets us run the existing inspect infrastructure as designed, instead of relying on another custom implementation that's intertwined with the redact module.

To do:

Refactor the compose module with the Pipeline to improve both of them, especially w/r/t type checking
Ensure that the results of inspect are properly accumulated inside the Context object
Ensure that the Context inspect results are properly fed back into the chunk processor
Smarter cutting of inputs (refactor from old code)
Smarter stitching of outputs (refactor from old code)
Tests
Cleanup

Fixes #84

chohlasa · 2025-03-20T13:44:15Z

bc2/core/common/openai_metadata.py

+        ModelMeta(name="gpt-4o-2024-05-13", context=128_000, output=4_096),
+        ModelMeta(name="gpt-4o-2024-08-06", context=128_000, output=16_384),
+        ModelMeta(name="gpt-4o-2024-11-20", context=128_000, output=16_384),
+        ModelMeta(name="o3-mini-2025-01-31", context=200_000, output=100_000),


I know that the idea of output token limits can be a little fuzzy for reasoning models (since some of the tokens may be eaten up by the reasoning phase) — do we account for that somehow?

no we don't, but we didn't before either. it's a good point but it's not something I'm trying to solve / look into here.

chohlasa · 2025-03-20T13:46:28Z

bc2/core/control/chunk.py

+                    ),
+                    # 2. The original existing text should be *complete*, while the addition is not!
+                    existing_t.original,
+                    # 3. The initial delimiters are empty, so fill them in from the addition.


Very small nit but I think this comment should say
If the initial delimiters are empty, fill them in from the addition

If I understand correctly?

I could see why you think that, it's true, but not what I'm emphasizing ... the very first delimiters property we see will be empty because we don't know what the value is and the property must be filled in from the first. In subsequent runs, the property won't be empty and won't be filled in from the addition.

chohlasa · 2025-03-20T13:47:39Z

bc2/core/control/chunk.py

+            new_text = cast(RedactedText, new).redacted
+        else:
+            raise ValueError(f"Unsupported return type: {self.return_type}")
+        return residual(old_text, new_text, needle_size=window_size)


I'm curious from a design perspective, is there any reason to keep this function in common/align.py? This seems like it might be the only place we use it now.

doesn't matter to me!

chohlasa · 2025-03-20T13:52:03Z

bc2/data/prompts/redact.txt


-For people in the following list, replace their name or any associated nickname with the pre-specified placeholder. Do not change this placeholder in any way, use it exactly as it was provided.
-{preset_aliases}
+Replace all names specified by the `RealName` element in the following XML with the pre-specified placeholder given by the `Placeholder` element. Do not change this placeholder in any way, use it exactly as it was provided. Consider variants of the `RealName` if they refer to the same person.


Will this handle placeholders for non-human entities (e.g., for a given race; or a restaurant or address) between documents?

sorry I should've emphasized not to look at anything outside of chunk.py becuase it's still a WIP as I sort out other issues! the prompt included, this is not ready.

jnu added 5 commits March 17, 2025 12:21

refactor and add a control module with basic chunk config

a66c562

continue implementing the chunkable driver for redactions

06edd32

finish generalized chunker

cfa123b

clean up

b8d734f

better type inference

73a3fc9

chohlasa reviewed Mar 20, 2025

View reviewed changes

jnu added 13 commits March 20, 2025 14:36

finish refactoring pipe and compose

54c6d96

fix typings

5eb09e0

clean up and bug fixes

4335caf

cleanup and start fixing tests

3840947

improve chunk typing and validating, fix all tests

a61f20f

cleanup

a064eb4

add validation for chunk config

80d5a64

expand docs

16fd322

start adding tests

78571c3

bug fixes and improvements

eb7fec1

refactoring and improvements

061e0ad

refactoring and improvements

eea45fc

cleanup

379d33c

jnu changed the title ~~[WIP] Improved chunking algorithm~~ Improved chunking algorithm Mar 24, 2025

jnu added 4 commits March 24, 2025 17:51

fix and add tests

43a104f

rename name maps

3bba781

refactor annotations to reduce interdependencies

9d37015

clean up, fix tests

15d3eda

jnu merged commit e8ef381 into main Mar 25, 2025
1 check passed

jnu deleted the chunk branch March 25, 2025 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved chunking algorithm#85

Improved chunking algorithm#85
jnu merged 22 commits intomainfrom
chunk

jnu commented Mar 17, 2025 •

edited

Loading

Uh oh!

chohlasa Mar 20, 2025

Uh oh!

jnu Mar 20, 2025

Uh oh!

chohlasa Mar 20, 2025

Uh oh!

jnu Mar 20, 2025

Uh oh!

chohlasa Mar 20, 2025

Uh oh!

jnu Mar 20, 2025

Uh oh!

chohlasa Mar 20, 2025

Uh oh!

jnu Mar 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jnu commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jnu commented Mar 17, 2025 •

edited

Loading