Skip to content

Latest commit

 

History

History

tools-notes

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README for tools-notes

This folder is not part of the website or production code. It is used to collect "background" notes on various tools, such as dataset tools from Hugging Face. See the enclosed folders for more details.

See also the ../data-processing-notes.

Some tools are covered with dedicated folders, e.g., those where sample Python scripts are provided. Others are described here.

Presidio

Presidio from Microsoft provides context aware, pluggable, and customizable data protection and de-identification SDK for text and images.

It is used as part of the processing pipeline for the PubMed Guidelines dataset from the EPFL LLM team. This dataset is used to train their Meditron models, which were trained using their fork of NVIDIA's Megatron-LM training library.