This folder is not part of the website or production code. It is used to collect "background" notes on various tools, such as dataset tools from Hugging Face. See the enclosed folders for more details.
See also the ../data-processing-notes
.
Some tools are covered with dedicated folders, e.g., those where sample Python scripts are provided. Others are described here.
Presidio from Microsoft provides context aware, pluggable, and customizable data protection and de-identification SDK for text and images.
It is used as part of the processing pipeline for the PubMed Guidelines dataset from the EPFL LLM team. This dataset is used to train their Meditron models, which were trained using their fork of NVIDIA's Megatron-LM training library.