Code-prose-composition tagger. #247

no0p · 2025-02-28T19:07:42Z

Tagger for Code Prose Composition

Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.

Produces tags for

code/prose/other composition as a percent of the document
code/prose/other mean entropy
code/prose/other line counts
coder-prose boundary count

Recommended filter for mixed prose/code content based on these tags is:

exp__code_prose_composition__code > 0.05
exp__code_prose_composition__prose > 0.3
exp__code_prose_composition__code_count >= 8
exp__code_prose_composition__code_mean_entropy < 0.5

The code entropy adjusts for bias towards code for short string including "code-y" characters like (, ), [, ], : etc due to a lack of nice negative examples. Until time for an improved classifier is available, including a filter for high confidence code predictions via mean entropy works sufficiently well.

One More Thing

Updated a pre-suite hook to set the multiprocessing start method to spawn to prevent a side effect where test case dependencies may set it to the default fork, violating runtime assertions.

Add a tagger that adds attributes for code-prose-other composition of files based on line classifications. Coverage for code-prose-composition tagger. Improve error messages for spawn method checks. Set multiprocessing start method in test setup Set multiprocessing in test case with error handling. Add before suite hook to set mp start method. The default multiprocessing start method is "fork" which is not compatible with with runtime assertions that it is set to spawn. When running unit tests, it's possible to call an external library that sets the start method to "fork". Here we enforce the start method to be "spawn" for all tests before executing. linting. Remove error log messages.

Additionally update commentary and word choice.

Whattabatt · 2025-02-28T23:35:14Z

Generated attribute line example:

I think you might be able to keep this shorter by cutting the key in line_label, maybe elsewhere

Whattabatt

LGTM once you see if you can make the attribute keys shorter

Whattabatt · 2025-03-01T00:59:25Z

python/dolma/taggers/code_composition.py

+from ..core.registry import TaggerRegistry
+
+
+@TaggerRegistry.add("code-prose-composition")


Other taggers use underscores instead of dashes

Shorten prediction labels for readability and type-ability.

no0p · 2025-03-01T17:46:00Z

It looks like the format produced is as follows.

"{experiment_name}__{tagger_name}__{prediction_label}"

In the posted example, the experiment name matching the tagger name makes it look worse. Experiment name is somewhat out of our control. Nonetheless, there is room for improvement, so I made the following changes in the interest of brevity.

code_prose_composition -> code_composition

code_count -> code
prose_count -> prose
other_count -> other

code_composition -> code_pct
prose_composition -> prose_pct
other_composition -> other_pct

code_mean_entropy -> code_entropy
prose_mean_entropy -> prose_entropy
other_mean_entropy -> other_entropy

code_prose_boundaries -> boundaries

This results in records like the following:

{"id":"1","attributes":{
"my_experiment__code_composition__boundaries":[[0,616,0.0]],
"my_experiment__code_composition__other_pct":[[0,616,0.33]],
"my_experiment__code_composition__other":[[0,616,1.0]],
"my_experiment__code_composition__other_entropy": [[0,616,1.22949]],
"my_experiment__code_composition__prose_pct": [[0,616,0.67]],
"my_experiment__code_composition__prose": [[0,616,2.0]],
"my_experiment__code_composition__prose_entropy":[[0,616,0.002]]},
"source":"fake"}

I think it's a little more manageable and decent trade-off between explicitness and gigantic label keys. If you see more room for improvement, please let me know.

Whattabatt · 2025-03-01T21:15:50Z

I think it's a little more manageable and decent trade-off between explicitness and gigantic label keys. If you see more room for improvement, please let me know.

Yeah, good enough, dolma attributes are always a bit ugly for human reading

no0p force-pushed the code-prose-composition-b branch 2 times, most recently from bed5fe8 to 9eae084 Compare February 28, 2025 21:20

no0p force-pushed the code-prose-composition-b branch from 9eae084 to db8b744 Compare February 28, 2025 21:50

Add code-prose tagger to init import.

e715591

Additionally update commentary and word choice.

no0p requested review from soldni and Whattabatt February 28, 2025 22:14

Whattabatt approved these changes Feb 28, 2025

View reviewed changes

Whattabatt reviewed Mar 1, 2025

View reviewed changes

Brevity in prediction labels.

845fe2f

Shorten prediction labels for readability and type-ability.

no0p merged commit a1755cd into main Mar 4, 2025
14 checks passed

no0p deleted the code-prose-composition-b branch March 4, 2025 00:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code-prose-composition tagger. #247

Code-prose-composition tagger. #247

no0p commented Feb 28, 2025 •

edited

Loading

Whattabatt commented Feb 28, 2025

Whattabatt left a comment

Whattabatt Mar 1, 2025

no0p commented Mar 1, 2025 •

edited

Loading

Whattabatt commented Mar 1, 2025

		from ..core.registry import TaggerRegistry


		@TaggerRegistry.add("code-prose-composition")

Code-prose-composition tagger. #247

Code-prose-composition tagger. #247

Conversation

no0p commented Feb 28, 2025 • edited Loading

Tagger for Code Prose Composition

One More Thing

Whattabatt commented Feb 28, 2025

Whattabatt left a comment

Choose a reason for hiding this comment

Whattabatt Mar 1, 2025

Choose a reason for hiding this comment

no0p commented Mar 1, 2025 • edited Loading

Whattabatt commented Mar 1, 2025

no0p commented Feb 28, 2025 •

edited

Loading

no0p commented Mar 1, 2025 •

edited

Loading