The following is a set of Jupyter notebook tutorials which demonstrate how to use various text classification models supported by NeMo Curator. The goal of using these classifiers is to help with data annotation, which is useful in data blending for foundation model training.
Each of these classifiers are available on Hugging Face and can be run independently with the Transformers library. By running them with NeMo Curator, the classifiers are accelerated using a heterogenous pipeline setup where tokenization is run across CPUs and model inference is run across GPUs. Each of the Jupyter notebooks in this directory demonstrate how to run the classifiers on text data and are easily scalable to large amounts of data.
Before running any of these notebooks, see this Installation Guide page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.
For more information about the classifiers, refer to our Distributed Data Classification documentation page.
| NeMo Curator Classifier | Description | Hugging Face Page |
|---|---|---|
AegisClassifier |
Identify and categorize unsafe content per document | nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0 and nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0 |
ContentTypeClassifier |
Categorize the type-of-speech per document | nvidia/content-type-classifier-deberta |
DomainClassifier |
Categorize the domain per document | nvidia/domain-classifier |
FineWebEduClassifier |
Determine the educational value per document; this model was trained using annotations from Llama 3 70B-Instruct | HuggingFaceFW/fineweb-edu-classifier |
FineWebMixtralEduClassifier |
Determine the educational value per document; this model was trained using annotations from Mixtral 8x22B-Instruct | nvidia/nemocurator-fineweb-mixtral-edu-classifier |
FineWebNemotronEduClassifier |
Determine the educational value per document; this model was trained using annotations from Nemotron-4-340B-Instruct | nvidia/nemocurator-fineweb-nemotron-4-edu-classifier |
InstructionDataGuardClassifier |
Identify LLM poisoning attacks per document | nvidia/instruction-data-guard |
MultilingualDomainClassifier |
Categorize the domain per document; supports classification in 52 languages | nvidia/multilingual-domain-classifier |
PromptTaskComplexityClassifier |
Classifies text prompts across task types and complexity dimensions | nvidia/prompt-task-and-complexity-classifier |
QualityClassifier |
Categorize documents as high, medium, or low quality | quality-classifier-deberta |
Note that all classifiers support English text classification only, except the MultilingualDomainClassifier.
Advanced users may want to integrate their own Hugging Face classifier(s) into NeMo Curator. Broadly, this requires creating a CompositeStage consisting of a CPU-based tokenizer stage and a GPU-based model inference stage. Refer to the Text Classifiers README for details about how to do this.