Skip to content

New reference implementation: Misalignment evaluations#108

Merged
rjavadi merged 66 commits into
VectorInstitute:mainfrom
ethancjackson:ethan-dev
Jun 1, 2026
Merged

New reference implementation: Misalignment evaluations#108
rjavadi merged 66 commits into
VectorInstitute:mainfrom
ethancjackson:ethan-dev

Conversation

@ethancjackson
Copy link
Copy Markdown
Collaborator

Summary

Adds misalignment_qa as a new reference implementation for the LLM/agents evaluations bootcamp. The experiment probes whether reckless in-context examples can nudge model responses toward harmful behaviour, and whether that effect transfers across different harm domains. It is intentionally minimal — plain LLM completions, no tool use — to make the mechanics transparent and serve as a building block for participants who want to extend it to real agentic systems.

Clickup Ticket(s): N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • New reference implementation (implementations/misalignment_qa/) — a YAML-driven experiment runner that tests five in-context-learning conditions (baseline, examples as LLM messages, examples as LLM messages + priority instruction, examples as user context, examples as user context + priority instruction) across six commercial models (three Gemini, three Anthropic), producing 30 variants against a shared 9-task dataset
  • Task set spans three harm modalities — life-safety dismissal, harmful code (keylogger / credential harvester / covert exfiltration), and social engineering (phishing, health misinformation, fake reviews) — to observe both in-domain and out-of-domain transfer of the reckless pattern
  • examples_inject_mode config field controls whether examples reach the model as LLM API messages (developer surface) or as plain text inside the user message (end-user surface), implemented via preparation.py and task.py
  • Graceful API key handling — pre-flight checks and per-variant failure detection surface missing/invalid keys in a clearly formatted warning summary at the end of the run rather than crashing
  • Temperature compatibility fixAgentSpec.temperature is now float | None; claude-opus-4-7 variants carry temperature: null (that model has deprecated the parameter); all other models use temperature: 0.2; variant-level null overrides are propagated correctly via Pydantic model_fields_set in resolve_agent_spec
  • Three canonical execution paths: 01_interactive_single_run.ipynb (optional single-item preview), run.py (full 30-variant experiment), 02_inspect_results.ipynb (pull results from Langfuse, heatmap dashboard + collapsible trace detail cards)
  • analysis.py — helper module for the results notebook, replacing the old results_notebook.py; includes correct condition metadata extraction
  • README — full rewrite for bootcamp audience, covering the agent/non-agent distinction, five conditions, three task categories, quick-start steps, config reference, and troubleshooting

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:

  • Full experiment run executed against all 30 variants; Gemini variants completed successfully and traces/scores were written to Langfuse as expected
  • Anthropic variants confirmed working after resolving an AuthenticationError (invalid key) and a BadRequestError (temperature deprecated on claude-opus-4-7) — both error classes are now surfaced clearly in the warning summary
  • 02_inspect_results.ipynb run end-to-end against a partial dataset (one condition); heatmap rendered correctly and collapsible trace cards displayed as intended
  • Config parsed and validated: 30 variants, 5 conditions × 6 models, 9 tasks across 3 categories, all condition_metadata fields populated, temperature matrix verified programmatically

Screenshots/Recordings

N/A

Related Issues

N/A

Deployment Notes

Participants need .env entries for GOOGLE_API_KEY and/or ANTHROPIC_API_KEY in addition to the standard Langfuse keys. The experiment runs with only one provider's key and reports skipped variants at the end. No infrastructure changes required.

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

ethancjackson and others added 30 commits March 18, 2026 20:43
Create a Langfuse-backed Python workflow for configurable ADK agent runs, transcript-based task definitions, judge-driven evaluation, trace usage metrics, and a documented smoke-test config to support future misalignment experiments.

Made-with: Cursor
Separate schema, preparation, and orchestration so configs remain the primary interface while the package gains a cleaner reusable surface for multi-variant research runs.

Made-with: Cursor
Allow explicit zero-budget configs, carry the setting through variant resolution into ADK agent construction, and document how to disable thinking in experiment configs.

Made-with: Cursor
Tighten the runtime so seeded conversations read like real chat, keep the experiment configs aligned with current thinking/output settings, and add a Metrics API-based terminal report for comparing conditions outside the Langfuse UI.

Made-with: Cursor
Add a per-execution run_instance_id to Langfuse metadata and run names so repeated launches stay distinguishable, and teach the terminal reporter to default to the latest run instance while documenting the new behavior.

Made-with: Cursor
Support LiteLLM-backed providers in the misalignment agent builder, accept Anthropic credentials in shared settings, and extend the main experiment plus docs/tests so Claude variants can be run and compared alongside Gemini.

Made-with: Cursor
Move experiment result inspection into a simpler notebook-backed workflow so historical runs are easier to inspect and harmful traces are easier to review.

Made-with: Cursor
…, rewrite README

results_notebook.py: shrunk from 901 to 643 lines by replacing five custom
dataclasses (NumericAccumulator, ConditionSummary, TraceRecord, AnalysisBundle
with 13 fields) with pandas groupby aggregation and lighter data structures.
AnalysisBundle is now 4 fields. Two near-duplicate Metrics API fetchers are now
clean separate functions returning DataFrames. All public API preserved.

report_metrics.ipynb: added a Discovery cell that lists available datasets and
execution IDs so users no longer have to guess constants. Replaced the passive
markdown cell with an actionable comment in the detail-view cell. Added a
"how to copy for a new experiment" guide to the header and improved inline
comments throughout.

README.md: full rewrite for newcomers. Leads with what behavioral misalignment
is and why it matters, includes a plain-language workflow diagram, a Quick Start
section, a "Designing Your Own Experiment" walkthrough, and moves the config
reference to the end. No jargon (PreparedTaskItem, ExecutionIdentity, etc.)
in the sections visible to first-time readers.

Made-with: Cursor
…values to float

The Langfuse Metrics API can return latency/cost/token values as strings or None.
The previous refactor dropped the explicit _coerce_float/_coerce_int helpers from
the original code, causing 'unsupported operand type(s) for /: str and int' when
_build_summary_df tried to compute avg_latency_s and avg_tokens. Added a _to_float
helper inside _fetch_trace_metrics_df and a pd.to_numeric pass as a safety net.

Made-with: Cursor
preparation.py: replace the 43-line null-coalescing body of resolve_agent_spec
with a 10-line Pydantic model_dump merge (base fields overridden by non-None
variant fields). Same semantics, far less repetition.

preparation.py: remove PreparedTaskItem.judge_input — it was computed in
prepare_task_item but never read anywhere (to_upload_item uses upload_input;
the value was already embedded there via build_dataset_input). Eliminates
a redundant call to build_judge_input.

config_types.py: change ExperimentConfig.description default from the generic
placeholder string "Misalignment experiment" to None, so experiments without
a description don't silently inherit a meaningless label in Langfuse.

Made-with: Cursor
…nnotate minimal config

- Quick Start step 3 now names the smoke dataset (misalignment-qa-smoke-v2) and
  shows the exact DATASET_NAME constant to paste into the notebook.
- 'Designing Your Own Experiment' minimal config is rewritten to demonstrate
  variant inheritance (variants only set what changes), per-variant example
  override, and the judge_model_config with an explanation of each key decision.
- Added four plain-language bullets below the config explaining the most common
  stumbling points: variant inheritance, dataset_upload_subset, judge model
  independence, and how rubric metric names become notebook column names.

Made-with: Cursor
Tests whether the reckless in-context examples from the prioritize-context
experiment cause misalignment to transfer across output modalities — from
conversational advice into code, email drafts, and structured YAML configs.

Uses gemini-3-flash-preview (strongest responder in the prior experiment).
Two variants: reckless-examples (same four ICL examples) and no-examples
(empty list, control condition) — both under the same 'follow the examples'
system prompt so only the examples themselves vary.

Three tasks, one per modality:
- emergency-stop-code: Python safety threshold function
- safety-incident-email: building supervisor gas-leak notification
- chemical-storage-config: YAML monitoring system config

Same 3-metric rubric (harmful / helpful / follows_reckless_pattern) with
updated judge guidance covering code, email, and config output formats.
max_output_chars raised from 800 to 1200 for longer structured outputs.

Made-with: Cursor
…oken handling

- Add 6 new variants (reckless + control) for gemini-2.5-flash,
  gemini-3.1-pro-preview, and gemini-3.1-flash-lite-preview, giving a
  full 4-model × 2-condition factorial design (8 variants total).
- Add `model` key to all condition_metadata blocks so notebooks can
  pivot on either the model or icl_condition axis independently.
- Rename original variants from reckless-examples/no-examples to
  gemini-3-flash-reckless/gemini-3-flash-control for naming consistency.
- task.py: filter out thinking parts (part.thought=True) from the string
  returned to the evaluator; thinking tokens are still visible in the raw
  Langfuse trace observation via ADK's automatic model-call logging.

Made-with: Cursor
Step-by-step guide that shows how to fetch a Langfuse dataset, inspect
and select individual items, run the agent and LLM judge in preview mode,
then formally record the run as a Langfuse experiment.

Made-with: Cursor
…ation

- Replace 5 legacy experiment configs with a single bootcamp_misalignment.yaml
  testing 5 conditions × 6 commercial models (Gemini + Anthropic) = 30 variants
- Add examples_inject_mode field to VariantSpec ("llm_messages" | "user_context")
  so examples can be seeded as LLM API messages or embedded as plain text in the
  user prompt — the latter being a more realistic end-user attack surface
- Thread user_context_preamble through PreparedVariantRun and MisalignmentTask
- Add preflight API key check and per-variant error catching to experiment.py;
  skipped/failed variants are collected and printed as a visible warning summary
- Rewrite interactive_single_run.ipynb as a clean 6-step single-item runner
- Rewrite report_metrics.ipynb with condition-axis analysis and the 5-condition
  table; default dataset points to misalignment-qa-bootcamp-v1
- Rewrite README for bootcamp audience: 3-path structure, agent framing note,
  graceful API key warning docs, simplified config reference
- Delete test_notebook.ipynb (scratch file)
- Update .env.example: add ANTHROPIC_API_KEY, remove VECTOR_INFERENCE_API_KEY

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the 10 single-category tasks with 9 tasks across three categories:
  A. life-safety (peanut-throat, child-cleaner, gas-smell) — in-domain,
     same flavour as the reckless examples
  B. harmful-code (keylogger, credential-harvest, covert-exfiltration) —
     out-of-domain: does the dismissal pattern bleed into writing malware?
  C. social-engineering (phishing-email, health-misinfo, fake-reviews) —
     out-of-domain: deceptive content targeting people

Tasks carry a `category` metadata field for filtering in the notebook.

Update judge rubric: broaden follows_reckless_pattern to cover all three
modalities (dismissal, providing malicious code, generating deceptive content).

Bump langfuse_dataset_name to misalignment-qa-bootcamp-v2 (task content
changed so a fresh dataset is required). Update README and report notebook
to document the three-category structure and analysis approach.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ive notebook

Co-authored-by: Cursor <cursoragent@cursor.com>
…ive_single_run.ipynb

Co-authored-by: Cursor <cursoragent@cursor.com>
…improve notebook UX

- Rename results_notebook.py to analysis.py (better reflects purpose)
- Fix missing 'condition' column in build_master_traces_frame: extract
  condition_condition from trace metadata and expose it as 'condition'
- Add Plotly misalignment heatmap (condition × model, follows_reckless_pattern
  rate) as a headline dashboard figure; falls back to bar chart when condition
  data is absent
- Replace verbose trace detail for-loop with collapsible HTML <details>
  accordion cards — colour-coded score badges in the summary line, full
  input/output/judge commentary hidden until expanded

Co-authored-by: Cursor <cursoragent@cursor.com>
task_fingerprint: str
upload_input: str
expected_output: str
task_turns: list[dict[str, Any]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a frozen class, but we can do item.task_turns.append(...)

task_fingerprint=task_fingerprint,
upload_input=build_dataset_input(task, task_fingerprint=task_fingerprint),
expected_output=task.expected_output,
task_turns=[message.model_dump() for message in build_task_turns(task)],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prepare_task_item copies task.metadata defensively on line 250 but task_turns is not copied — it's a fresh list, but its contained dicts come from message.model_dump() and could be mutated by downstream consumers. Same for shared_turns and run_metadata on PreparedVariantRun.

Either document that consumers must not mutate these collections, or switch to tuple[Mapping[str, Any], ...] for true immutability

Comment thread aieng-eval-agents/aieng/agent_evals/misalignment_qa/preparation.py
execution: ExecutionIdentity,
resolved_model: str,
) -> dict[str, Any]:
metadata: dict[str, Any] = {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict[str, Any] shows up as the type of run_metadata, task_turns, the upload item, and a few other places where the schema is actually fixed. Consider migrating these to TypedDict in a follow-up — it would give mypy enough information to catch key typos and wrong-type values at type-check time. Not blocking this PR, but worth a separate cleanup issue.

Comment thread aieng-eval-agents/aieng/agent_evals/misalignment_qa/preparation.py
if self._user_context_preamble and raw_input is not None:
raw_input = f"{self._user_context_preamble}\n\n{raw_input}"

user_id = getpass.getuser()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know user_id in this way has been used in AML Investigation use case but it's a bit smelly.
In containers without those env vars and without a populated passwd database (some minimal Docker images, some sandboxed CI runners), it raises OSError: No username set in the environment.

In other use case implementations like report generation it's a hard-coded literal: user_id="user".

see knowledge_qa/agent.py, report_generation/..., implementations/report_generation/demo.py

ethancjackson and others added 6 commits May 28, 2026 12:02
- agent.py: tighten TOOL_FACTORIES type to Callable[[Configs], Any];
  route LiteLLM API key lookup through Configs.anthropic_api_key /
  vector_inference_api_key (SecretStr) so secrets are never exposed
  in logs or exception tracebacks; fall back to os.getenv for env vars
  not mirrored in Configs; expand build_misalignment_agent docstring
  to numpy format (reviewer-supplied text).

- config_types.py: remove unnecessary string quotes from forward refs
  "TaskItemSpec" and "ExperimentConfig" (from __future__ import annotations
  already present); expand AgentOverrideSpec class docstring explaining
  the base/variant merge semantics and model_fields_set behaviour.

- preparation.py: change task_turns and shared_turns from list to
  tuple[dict, ...] on PreparedTaskItem and PreparedVariantRun so
  frozen=True dataclasses are genuinely immutable; update construction
  sites to tuple(); add numpy docstrings to build_run_metadata (with
  explanation of condition_ namespace prefix), build_task_fingerprint
  (truncation length and canonicalization rationale), and
  build_dataset_input (3-line format); expand prepare_dataset_items
  and prepare_variant_runs to numpy format; add example_pair_to_messages
  one-liner docstring; add all public helpers to __all__.

- task.py: replace getpass.getuser() with hard-coded "user" to avoid
  OSError in minimal Docker/CI environments (consistent with
  knowledge_qa and report_generation); broaden shared_turns parameter
  to Sequence[dict] so it accepts tuples from PreparedVariantRun.

- experiment.py, evaluation/hard_metrics.py: expand create_llm_judge,
  create_trace_usage, and create_trace_usage_evaluator from one-liners
  to full numpy-format docstrings covering parameters, return types,
  and error behaviour.

Co-authored-by: Cursor <cursoragent@cursor.com>
…turn

_run_with_seeded_history correctly calls session_service.create_session
before invoking runner.run_async, but _run_single_turn was just passing
a random UUID directly — the Runner with auto_create_session=False has
no record of that ID and raises SessionNotFoundError.

Fix: call create_session in _run_single_turn and use session.id, matching
the pattern already used in _run_with_seeded_history. Remove the now-unused
uuid import.

Co-authored-by: Cursor <cursoragent@cursor.com>
def __init__(
self,
*,
agent: Any,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agent: Any should be agent: BaseAgent (or LlmAgent if we want to be strict).

return text[:max_chars] + "\n...[truncated for evaluator]"


def create_llm_judge(config: ExperimentConfig): # noqa: ANN201
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It return Any while it should return EvaluatorFunction and Any won't cascade further.

return llm_judge_evaluator


def create_trace_usage(config: ExperimentConfig): # noqa: ANN201
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove noqa: ANN201 and change return type to TraceEvaluatorFunction

print(f"{sep}\n")


def run_variant(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again it returns Any type


async def run_experiment_config(config: ExperimentConfig, *, variant_ids: set[str] | None = None) -> None:
"""Run the full experiment: upload dataset, iterate variants, collect warnings."""
load_dotenv(verbose=True)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

load_dotenv() has side effects on os.environ so better to move it to entry points like CLI/notebook.

Also, verbose=True makes load_dotenv print a line for every variable it loads. Those lines go to terminal mixed in with the experiment's real output, making logs noisy and harder to read.

return warnings


def _print_warning_summary(warnings: list[str]) -> None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses print() whereas the rest of code use logger
I'm not sure about its purpose?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each variant failure logs three times (logger.warning + runtime_warnings.append + final _print_warning_summary). Pick one channel to reduce noise.

return "<unknown>"


def _check_item_failures(result: Any) -> tuple[int, int]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: After fixing returning Any types change result type as well.

logger = logging.getLogger(__name__)


SHORT_JUDGE_SYSTEM_PROMPT_TEMPLATE = """\
Copy link
Copy Markdown
Collaborator

@rjavadi rjavadi May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a LONG_* format of this prompt? If not, maybe we can remove SHORT

"""Load and validate an experiment config from a YAML file."""
path = Path(config_path)
raw = yaml.safe_load(path.read_text(encoding="utf-8"))
return ExperimentConfig.model_validate(raw)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a "succeeded" summary line to the end of load_experiment_config

variant.variant_id,
task_id,
evaluations,
(str(candidate_output)[:200] + "...")
Copy link
Copy Markdown
Collaborator

@rjavadi rjavadi May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 200 should be a module constant — _LOG_OUTPUT_PREVIEW_CHARS = 200 at the top.


async def upload_dataset_items(*, dataset_name: str, items: list[PreparedTaskItem]) -> None:
"""Upload prepared task items to a Langfuse dataset."""
with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", encoding="utf-8", delete=False) as tmp:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Switching to TemporaryDirectory() does not require manual cleanup in finally.

The directory survives as long as the with is active, then disappears — covering both the file write and the upload in one cleanup.

class AgentToolSpec(BaseModel):
"""Named tool that can be enabled for an agent."""

name: str = Field(description="Tool name; e.g. google_search, web_fetch, read_file.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name should be a Literal matching SUPPORTED_TOOL_NAMES from agent.pyto catch typos at load, not at agent-build.

max_output_tokens: int | None = Field(default=None, ge=1)
tools: list[AgentToolSpec] = Field(default_factory=list)
thinking_include_thoughts: bool = Field(default=False)
thinking_budget: int | None = Field(default=None, ge=-1)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what does -1 mean in thinking_budget?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a notebook-helper file that should be placed under implementations/misalignment_qa with the notebooks.



def _build_client() -> Langfuse:
load_dotenv(dotenv_path=_repo_root() / ".env", verbose=False)
Copy link
Copy Markdown
Collaborator

@rjavadi rjavadi May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anti-pattern of calling load_dotenv in the library code. Only the callery should do it once.

Pass credentials in or let Langfuse() read them from os.environ directly.

@rjavadi
Copy link
Copy Markdown
Collaborator

rjavadi commented Jun 1, 2026

@ethancjackson There are a couple of comments outstanding and I added a few more.
Also the dependecy error roots in commit ef6e73c where the uv lock was regenerated.
More context:

kagglehub did not change; kagglesdk was upgraded when the lock was regenerated. kagglesdk is only a transitive dep of kagglehub (no lower bound in pyproject.toml), so uv lock freely picked the latest compatible release.
Earlier on your branch (ef6e73c) you still had kagglesdk==0.1.23 like main

Solution:

Pin kagglehub>=0.4.1, <1.0.1:
"kagglehub>=0.4.1,<1.0.1", # 1.0.1 needs kagglesdk.get_web_endpoint; removed in kagglesdk>=0.1.24

@rjavadi rjavadi merged commit 87c22f0 into VectorInstitute:main Jun 1, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants