Skip to content

feat(skills): Create anti-hallucination-training skill for calibration and refusal data #308

@akaszubski

Description

@akaszubski

Summary

Create a skill documenting anti-hallucination training data generation and calibration workflows.

Context

ReAlign has AntiHallucinationGenerator and calibration training capabilities but they're not documented in skills.

Implementation Approach

Create .claude/skills/anti-hallucination-training.md documenting:

Anti-Hallucination Data Types

  1. Appropriate Refusal

    • Refuse when asked about unknown facts
    • Example: "I don't have information about events after my training cutoff"
  2. Hedged Uncertainty

    • Express uncertainty for partial knowledge
    • Example: "I'm not certain, but I believe..."
  3. Confident Answers

    • Answer confidently for verified facts
    • Provides contrast for calibration

Generator Classes

Class Purpose File
AntiHallucinationGenerator Generate refusal examples src/realign/data/antihallucination_generator.py
RefusalPreferenceGenerator Preference pairs for refusal src/realign/data/refusal_preference_generator.py
CalibrationTrainer Confidence calibration src/realign/backends/olmo/calibration_trainer.py

Confidence Levels

CONFIDENCE_LEVELS = {
    "high": "I am confident that...",
    "medium": "I believe that...",
    "low": "I'm not certain, but...",
    "uncertain": "I don't know..."
}

Training Data Mix

Recommended anti-hallucination mix:

  • 40% - Appropriate refusals (unknown facts)
  • 30% - Hedged uncertainty (partial knowledge)
  • 20% - Confident answers (verified facts)
  • 10% - Edge cases (ambiguous queries)

Commands

# Generate anti-hallucination data
python -m realign.data.antihallucination_generator \
  --input sft.jsonl \
  --output antihall.jsonl \
  --refusal-ratio 0.4 \
  --uncertainty-ratio 0.3

# Generate calibration data
python scripts/calibration_generator.py \
  --count 5000 \
  --output calibration.jsonl

Evaluation

# Check TruthfulQA score (target: ≥60%)
realign evaluate --model your-model --benchmark truthfulqa

# Check calibration ECE (target: ≤0.10)
realign evaluate --model your-model --benchmark calibration

Acceptance Criteria

  • All generator classes documented
  • Confidence levels defined
  • Training data mix ratios
  • CLI commands provided
  • Evaluation benchmarks documented

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions