Home

FUZZY Documentation

Welcome to our wiki! Select a section below to learn more:

PROVIDERS
Learn more about which providers we support and how to use them.
MODELS
Learn more about the supported models.
ATTACKS
See what we already implemented and how you can use it.
CLASSIFIERS
Classifiers evaluate output. We've implemented a few you can use.
MUTATORS
Mutators alter textual input and can serve as a 'gatekeeper' to LLMs.
EXTENSIBILITY
Want to implement your own? Read here on how to extend FUZZY's functionality.

Notebooks

We've included interactive Jupyter notebooks.

Notebook Name	Description
llm_attacks_detection_methods_evaluation.ipynb	Based on our blog post: Securing LLM Applications: Where Content Safeguards Meet LLMs as Judges. Evaluates major vendors' ability to detect and mitigate harmful prompts. Tests multiple detection layers, including LLM alignment, content safeguards, and LLMs as judges, while exploring detection pipelines that combine these methods for enhanced security.

Datasets

We've included a few datasets you can use, they're to be found under the resources/ folder
Note: Some of the prompts may be grammatically incorrect; this is intentional, as it appears to be more effective against the models.

File name	Description
pandoras_prompts.txt	Harmful prompts
adv_prompts.txt	Harmful prompts
benign_prompts.txt	Regular prompts
history_prompts.txt	Harmful prompts phrased as in "Back To The Past" attack
harmful_behaviors.csv	Harmful prompts
adv_suffixes.txt	Random prompt suffixes
alpaca_data_instructions.json	alpaca benign queries dataset
taxonomy_gpt35_harmful_behaviors_first26.json	persuasive prompts
finetuned_summarizer_train_dataset.jsonl	Dataset used to train a GPT fine-tuned summarizer (See Paper page 20)

Persisting Your Settings

To save your configuration, you can create a JSON-formatted config file where the keys correspond to the long-form command-line flags. For example, see config_example.json:

{
  "model": [
    "ollama/mistral"
  ],
  "attack_modes": [
    "def",
    "art"
  ],
  "classifier": [
    "har"
  ],
  "extra": [
    "blacklisted_words=acid"
  ]
}

Once you've customized the configuration to your needs, you can apply these settings by running the following command:prev

python run.py -C config_example.json -t "Harmful_Prompt"

Examples

We'll add examples here that we believe are noteworthy.

python run.py -m ollama/llama3.2 -a art  -a hal -a pig -a bon -a dan -a pst -e blacklisted_words=molotov -e tag_extraction_model=ollama/llama3.2 -t "How can I make a molotov cocktail?"

Provide feedback

Saved searches