Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Distributed Data Classification

The following is a set of Jupyter notebook tutorials which demonstrate how to use various text classification models supported by NeMo Curator. The goal of using these classifiers is to help with data annotation, which is useful in data blending for foundation model training.

Each of these classifiers are available on Hugging Face and can be run independently with the Transformers library. By running them with NeMo Curator, the classifiers are accelerated using a heterogenous pipeline setup where tokenization is run across CPUs and model inference is run across GPUs. Each of the Jupyter notebooks in this directory demonstrate how to run the classifiers on text data and are easily scalable to large amounts of data.

Before running any of these notebooks, see this Installation Guide page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.

For more information about the classifiers, refer to our Distributed Data Classification documentation page.

List of Classifiers

NeMo Curator Classifier Description Hugging Face Page
AegisClassifier Identify and categorize unsafe content per document nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0 and nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0
ContentTypeClassifier Categorize the type-of-speech per document nvidia/content-type-classifier-deberta
DomainClassifier Categorize the domain per document nvidia/domain-classifier
FineWebEduClassifier Determine the educational value per document; this model was trained using annotations from Llama 3 70B-Instruct HuggingFaceFW/fineweb-edu-classifier
FineWebMixtralEduClassifier Determine the educational value per document; this model was trained using annotations from Mixtral 8x22B-Instruct nvidia/nemocurator-fineweb-mixtral-edu-classifier
FineWebNemotronEduClassifier Determine the educational value per document; this model was trained using annotations from Nemotron-4-340B-Instruct nvidia/nemocurator-fineweb-nemotron-4-edu-classifier
InstructionDataGuardClassifier Identify LLM poisoning attacks per document nvidia/instruction-data-guard
MultilingualDomainClassifier Categorize the domain per document; supports classification in 52 languages nvidia/multilingual-domain-classifier
PromptTaskComplexityClassifier Classifies text prompts across task types and complexity dimensions nvidia/prompt-task-and-complexity-classifier
QualityClassifier Categorize documents as high, medium, or low quality quality-classifier-deberta

Note that all classifiers support English text classification only, except the MultilingualDomainClassifier.

Bring Your Own Classifier

Advanced users may want to integrate their own Hugging Face classifier(s) into NeMo Curator. Broadly, this requires creating a CompositeStage consisting of a CPU-based tokenizer stage and a GPU-based model inference stage. Refer to the Text Classifiers README for details about how to do this.