-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code-prose-composition tagger. #247
Conversation
bed5fe8
to
9eae084
Compare
Add a tagger that adds attributes for code-prose-other composition of files based on line classifications. Coverage for code-prose-composition tagger. Improve error messages for spawn method checks. Set multiprocessing start method in test setup Set multiprocessing in test case with error handling. Add before suite hook to set mp start method. The default multiprocessing start method is "fork" which is not compatible with with runtime assertions that it is set to spawn. When running unit tests, it's possible to call an external library that sets the start method to "fork". Here we enforce the start method to be "spawn" for all tests before executing. linting. Remove error log messages.
9eae084
to
db8b744
Compare
Additionally update commentary and word choice.
Generated attribute line example: I think you might be able to keep this shorter by cutting the key in line_label, maybe elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once you see if you can make the attribute keys shorter
from ..core.registry import TaggerRegistry | ||
|
||
|
||
@TaggerRegistry.add("code-prose-composition") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other taggers use underscores instead of dashes
Shorten prediction labels for readability and type-ability.
It looks like the format produced is as follows.
In the posted example, the experiment name matching the tagger name makes it look worse. Experiment name is somewhat out of our control. Nonetheless, there is room for improvement, so I made the following changes in the interest of brevity.
This results in records like the following:
I think it's a little more manageable and decent trade-off between explicitness and gigantic label keys. If you see more room for improvement, please let me know. |
Yeah, good enough, dolma attributes are always a bit ugly for human reading |
Tagger for Code Prose Composition
Add a tagger that adds attributes for code-prose-other composition of files based on line classifications.
Produces tags for
Recommended filter for mixed prose/code content based on these tags is:
The code entropy adjusts for bias towards code for short string including "code-y" characters like (, ), [, ], : etc due to a lack of nice negative examples. Until time for an improved classifier is available, including a filter for high confidence code predictions via mean entropy works sufficiently well.
One More Thing
Updated a pre-suite hook to set the multiprocessing start method to
spawn
to prevent a side effect where test case dependencies may set it to the defaultfork
, violating runtime assertions.