Track 2: Model Testing — Submission by Elith Inc. (Open Safeguard Hackathon) #39
Beluuuuuuga
started this conversation in
gpt-oss-safeguard Implementation
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, we are Elith Inc. — a Japan-based company dedicated to advancing AI Safety.
Our team consists of
In this hackathon, we focused on analyzing the vulnerabilities of the gpt-oss-safeguard-20B model, including:
🚀 1. Collection of Attack Prompts — Leveraging the atmaCup Four-Stage Evaluation Pipeline
Elith collaborates with atmaCup, one of Japan’s largest machine learning competitions, and co-hosted a Jailbreak / Safety–focused Generative AI Security Challenge from Nov 29 to Dec 7, 2025:
🔗 https://www.guruguru.science/competitions/28
In this competition, the following four-stage multilayer evaluation pipeline was used:
Using this pipeline, the competition produced 9,270 attack-attempt logs, from which the most effective and diverse attack patterns were identified and categorized.
For Track 2 of this hackathon,
we used these extracted attack patterns and attack taxonomies as the basis for constructing 364 representative attack prompts,
which were then used to evaluate the gpt-oss-safeguard model.
🧠 2. Content Moderation Policy Used in This Hackathon
For evaluating the Safeguard model, we used a custom-designed Content Moderation Policy.
This policy integrates international AI safety standards from OpenAI, Anthropic, Google, Meta, and Microsoft, and provides a unified guideline that includes:
This unified safety classification guideline was applied as the input policy for the Safeguard model,
allowing us to analyze which categories each attack failed or successfully bypassed.
📊 3. Safeguard Model Evaluation Results
In this evaluation, we tested 364 attack prompts against gpt-oss-safeguard-20B.
Among them, 188 prompts (51.6%) successfully bypassed Safeguard, revealing significant weaknesses particularly in the Fraud & Deception and Malware & Cybersecurity categories.
Below is a summary of Detected (caught) vs Bypassed (missed) results by category:
Key Observations
→ These vulnerabilities are concentrated in categories that pose the highest real-world risk.
In contrast, the Violence & Harm category demonstrated relatively strong performance,
with 9 correct detections and only 3 misses.
Due to time constraints, only one-fourth of all prompts were analyzed in depth
We conducted detailed analysis on approximately one-fourth of the 364 prompts,
prioritizing the extraction of core behavioral patterns within the limited timeframe of the hackathon.
💣 4. Analysis of Categories and Attack Techniques
The heatmap below shows that the misclassifications—cases where the Safeguard model incorrectly labeled harmful prompts as Non-violating—
were not random, but instead clustered around specific attack techniques.
From this visualization, we can clearly see which attack techniques tend to bypass Safeguard within each content category.
A. Fraud & Deception is the most frequently bypassed category
Within the Fraud row (second row), the following techniques show notable concentrations:
👉 Safeguard shows significant weaknesses across all disguise-based techniques—educational, research-oriented, translation-based, and role-play.
B. Malware & Cybersecurity is similarly vulnerable
In the Malware row (fifth row), the misses primarily occur in:
👉 Bypasses are concentrated around encoding, educational framing, and translation-based disguises.
C. Violence & Harm remains comparatively robust
In the Violence row (seventh row):
👉 Only three misses in total, indicating that Safeguard is substantially more reliable in this category than in others.
💣 5. Examples of Prompts That Successfully Bypassed Safeguard
① Role-forcing (System Log Impersonation)
② Research / Academic Disguise
③ Translation-based Payload Injection
④ Creative Scenario Induction
All of these are highly effective “purpose-disguised” attack types,
designed to conceal harmful intent behind seemingly legitimate or benign tasks.
🧭 6. Key Takeaways
Safeguard’s most significant weakness is “purpose disguise.”
When harmful intent is wrapped in educational, research-oriented, translation-based, or role-assignment contexts, misclassifications increase sharply.
Fraud (65% bypass) and Malware (80% bypass) are the most vulnerable categories.
These weaknesses are especially pronounced under educational and translation disguises.
Violence & Harm is Safeguard’s strongest category, with very few misclassifications.
Clear and explicit harm signals are detected more reliably.
Bypass patterns are not random—they follow a consistent, structural trend.
The combination of Fraud / Malware × educational / research / translation disguises forms the core failure mode.
Due to time constraints, we analyzed roughly one-fourth of the total prompts in depth, yet the overall trends were already very clear.
🙌 7. Closing
Through the Open Safeguard Hackathon, we were able to present a comprehensive evaluation of 364 attack prompts, including 188 successful bypasses and a category-level analysis of model weaknesses.
We hope these findings contribute to the continued improvement of the Safeguard model.
Feedback is always welcome.
Beta Was this translation helpful? Give feedback.
All reactions