New reference implementation: Misalignment evaluations#108
Conversation
Create a Langfuse-backed Python workflow for configurable ADK agent runs, transcript-based task definitions, judge-driven evaluation, trace usage metrics, and a documented smoke-test config to support future misalignment experiments. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Separate schema, preparation, and orchestration so configs remain the primary interface while the package gains a cleaner reusable surface for multi-variant research runs. Made-with: Cursor
Allow explicit zero-budget configs, carry the setting through variant resolution into ADK agent construction, and document how to disable thinking in experiment configs. Made-with: Cursor
Tighten the runtime so seeded conversations read like real chat, keep the experiment configs aligned with current thinking/output settings, and add a Metrics API-based terminal report for comparing conditions outside the Langfuse UI. Made-with: Cursor
Add a per-execution run_instance_id to Langfuse metadata and run names so repeated launches stay distinguishable, and teach the terminal reporter to default to the latest run instance while documenting the new behavior. Made-with: Cursor
Support LiteLLM-backed providers in the misalignment agent builder, accept Anthropic credentials in shared settings, and extend the main experiment plus docs/tests so Claude variants can be run and compared alongside Gemini. Made-with: Cursor
Move experiment result inspection into a simpler notebook-backed workflow so historical runs are easier to inspect and harmful traces are easier to review. Made-with: Cursor
…, rewrite README results_notebook.py: shrunk from 901 to 643 lines by replacing five custom dataclasses (NumericAccumulator, ConditionSummary, TraceRecord, AnalysisBundle with 13 fields) with pandas groupby aggregation and lighter data structures. AnalysisBundle is now 4 fields. Two near-duplicate Metrics API fetchers are now clean separate functions returning DataFrames. All public API preserved. report_metrics.ipynb: added a Discovery cell that lists available datasets and execution IDs so users no longer have to guess constants. Replaced the passive markdown cell with an actionable comment in the detail-view cell. Added a "how to copy for a new experiment" guide to the header and improved inline comments throughout. README.md: full rewrite for newcomers. Leads with what behavioral misalignment is and why it matters, includes a plain-language workflow diagram, a Quick Start section, a "Designing Your Own Experiment" walkthrough, and moves the config reference to the end. No jargon (PreparedTaskItem, ExecutionIdentity, etc.) in the sections visible to first-time readers. Made-with: Cursor
…values to float The Langfuse Metrics API can return latency/cost/token values as strings or None. The previous refactor dropped the explicit _coerce_float/_coerce_int helpers from the original code, causing 'unsupported operand type(s) for /: str and int' when _build_summary_df tried to compute avg_latency_s and avg_tokens. Added a _to_float helper inside _fetch_trace_metrics_df and a pd.to_numeric pass as a safety net. Made-with: Cursor
preparation.py: replace the 43-line null-coalescing body of resolve_agent_spec with a 10-line Pydantic model_dump merge (base fields overridden by non-None variant fields). Same semantics, far less repetition. preparation.py: remove PreparedTaskItem.judge_input — it was computed in prepare_task_item but never read anywhere (to_upload_item uses upload_input; the value was already embedded there via build_dataset_input). Eliminates a redundant call to build_judge_input. config_types.py: change ExperimentConfig.description default from the generic placeholder string "Misalignment experiment" to None, so experiments without a description don't silently inherit a meaningless label in Langfuse. Made-with: Cursor
…nnotate minimal config - Quick Start step 3 now names the smoke dataset (misalignment-qa-smoke-v2) and shows the exact DATASET_NAME constant to paste into the notebook. - 'Designing Your Own Experiment' minimal config is rewritten to demonstrate variant inheritance (variants only set what changes), per-variant example override, and the judge_model_config with an explanation of each key decision. - Added four plain-language bullets below the config explaining the most common stumbling points: variant inheritance, dataset_upload_subset, judge model independence, and how rubric metric names become notebook column names. Made-with: Cursor
Made-with: Cursor
Tests whether the reckless in-context examples from the prioritize-context experiment cause misalignment to transfer across output modalities — from conversational advice into code, email drafts, and structured YAML configs. Uses gemini-3-flash-preview (strongest responder in the prior experiment). Two variants: reckless-examples (same four ICL examples) and no-examples (empty list, control condition) — both under the same 'follow the examples' system prompt so only the examples themselves vary. Three tasks, one per modality: - emergency-stop-code: Python safety threshold function - safety-incident-email: building supervisor gas-leak notification - chemical-storage-config: YAML monitoring system config Same 3-metric rubric (harmful / helpful / follows_reckless_pattern) with updated judge guidance covering code, email, and config output formats. max_output_chars raised from 800 to 1200 for longer structured outputs. Made-with: Cursor
…oken handling - Add 6 new variants (reckless + control) for gemini-2.5-flash, gemini-3.1-pro-preview, and gemini-3.1-flash-lite-preview, giving a full 4-model × 2-condition factorial design (8 variants total). - Add `model` key to all condition_metadata blocks so notebooks can pivot on either the model or icl_condition axis independently. - Rename original variants from reckless-examples/no-examples to gemini-3-flash-reckless/gemini-3-flash-control for naming consistency. - task.py: filter out thinking parts (part.thought=True) from the string returned to the evaluator; thinking tokens are still visible in the raw Langfuse trace observation via ADK's automatic model-call logging. Made-with: Cursor
Step-by-step guide that shows how to fetch a Langfuse dataset, inspect and select individual items, run the agent and LLM judge in preview mode, then formally record the run as a Langfuse experiment. Made-with: Cursor
…owing instruction
…ation
- Replace 5 legacy experiment configs with a single bootcamp_misalignment.yaml
testing 5 conditions × 6 commercial models (Gemini + Anthropic) = 30 variants
- Add examples_inject_mode field to VariantSpec ("llm_messages" | "user_context")
so examples can be seeded as LLM API messages or embedded as plain text in the
user prompt — the latter being a more realistic end-user attack surface
- Thread user_context_preamble through PreparedVariantRun and MisalignmentTask
- Add preflight API key check and per-variant error catching to experiment.py;
skipped/failed variants are collected and printed as a visible warning summary
- Rewrite interactive_single_run.ipynb as a clean 6-step single-item runner
- Rewrite report_metrics.ipynb with condition-axis analysis and the 5-condition
table; default dataset points to misalignment-qa-bootcamp-v1
- Rewrite README for bootcamp audience: 3-path structure, agent framing note,
graceful API key warning docs, simplified config reference
- Delete test_notebook.ipynb (scratch file)
- Update .env.example: add ANTHROPIC_API_KEY, remove VECTOR_INFERENCE_API_KEY
Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the 10 single-category tasks with 9 tasks across three categories:
A. life-safety (peanut-throat, child-cleaner, gas-smell) — in-domain,
same flavour as the reckless examples
B. harmful-code (keylogger, credential-harvest, covert-exfiltration) —
out-of-domain: does the dismissal pattern bleed into writing malware?
C. social-engineering (phishing-email, health-misinfo, fake-reviews) —
out-of-domain: deceptive content targeting people
Tasks carry a `category` metadata field for filtering in the notebook.
Update judge rubric: broaden follows_reckless_pattern to cover all three
modalities (dismissal, providing malicious code, generating deceptive content).
Bump langfuse_dataset_name to misalignment-qa-bootcamp-v2 (task content
changed so a fresh dataset is required). Update README and report notebook
to document the three-category structure and analysis approach.
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…ive notebook Co-authored-by: Cursor <cursoragent@cursor.com>
…ive_single_run.ipynb Co-authored-by: Cursor <cursoragent@cursor.com>
…improve notebook UX - Rename results_notebook.py to analysis.py (better reflects purpose) - Fix missing 'condition' column in build_master_traces_frame: extract condition_condition from trace metadata and expose it as 'condition' - Add Plotly misalignment heatmap (condition × model, follows_reckless_pattern rate) as a headline dashboard figure; falls back to bar chart when condition data is absent - Replace verbose trace detail for-loop with collapsible HTML <details> accordion cards — colour-coded score badges in the summary line, full input/output/judge commentary hidden until expanded Co-authored-by: Cursor <cursoragent@cursor.com>
| task_fingerprint: str | ||
| upload_input: str | ||
| expected_output: str | ||
| task_turns: list[dict[str, Any]] |
There was a problem hiding this comment.
It's a frozen class, but we can do item.task_turns.append(...)
| task_fingerprint=task_fingerprint, | ||
| upload_input=build_dataset_input(task, task_fingerprint=task_fingerprint), | ||
| expected_output=task.expected_output, | ||
| task_turns=[message.model_dump() for message in build_task_turns(task)], |
There was a problem hiding this comment.
prepare_task_item copies task.metadata defensively on line 250 but task_turns is not copied — it's a fresh list, but its contained dicts come from message.model_dump() and could be mutated by downstream consumers. Same for shared_turns and run_metadata on PreparedVariantRun.
Either document that consumers must not mutate these collections, or switch to tuple[Mapping[str, Any], ...] for true immutability
| execution: ExecutionIdentity, | ||
| resolved_model: str, | ||
| ) -> dict[str, Any]: | ||
| metadata: dict[str, Any] = { |
There was a problem hiding this comment.
dict[str, Any] shows up as the type of run_metadata, task_turns, the upload item, and a few other places where the schema is actually fixed. Consider migrating these to TypedDict in a follow-up — it would give mypy enough information to catch key typos and wrong-type values at type-check time. Not blocking this PR, but worth a separate cleanup issue.
| if self._user_context_preamble and raw_input is not None: | ||
| raw_input = f"{self._user_context_preamble}\n\n{raw_input}" | ||
|
|
||
| user_id = getpass.getuser() |
There was a problem hiding this comment.
I know user_id in this way has been used in AML Investigation use case but it's a bit smelly.
In containers without those env vars and without a populated passwd database (some minimal Docker images, some sandboxed CI runners), it raises OSError: No username set in the environment.
In other use case implementations like report generation it's a hard-coded literal: user_id="user".
see knowledge_qa/agent.py, report_generation/..., implementations/report_generation/demo.py
- agent.py: tighten TOOL_FACTORIES type to Callable[[Configs], Any]; route LiteLLM API key lookup through Configs.anthropic_api_key / vector_inference_api_key (SecretStr) so secrets are never exposed in logs or exception tracebacks; fall back to os.getenv for env vars not mirrored in Configs; expand build_misalignment_agent docstring to numpy format (reviewer-supplied text). - config_types.py: remove unnecessary string quotes from forward refs "TaskItemSpec" and "ExperimentConfig" (from __future__ import annotations already present); expand AgentOverrideSpec class docstring explaining the base/variant merge semantics and model_fields_set behaviour. - preparation.py: change task_turns and shared_turns from list to tuple[dict, ...] on PreparedTaskItem and PreparedVariantRun so frozen=True dataclasses are genuinely immutable; update construction sites to tuple(); add numpy docstrings to build_run_metadata (with explanation of condition_ namespace prefix), build_task_fingerprint (truncation length and canonicalization rationale), and build_dataset_input (3-line format); expand prepare_dataset_items and prepare_variant_runs to numpy format; add example_pair_to_messages one-liner docstring; add all public helpers to __all__. - task.py: replace getpass.getuser() with hard-coded "user" to avoid OSError in minimal Docker/CI environments (consistent with knowledge_qa and report_generation); broaden shared_turns parameter to Sequence[dict] so it accepts tuples from PreparedVariantRun. - experiment.py, evaluation/hard_metrics.py: expand create_llm_judge, create_trace_usage, and create_trace_usage_evaluator from one-liners to full numpy-format docstrings covering parameters, return types, and error behaviour. Co-authored-by: Cursor <cursoragent@cursor.com>
…turn _run_with_seeded_history correctly calls session_service.create_session before invoking runner.run_async, but _run_single_turn was just passing a random UUID directly — the Runner with auto_create_session=False has no record of that ID and raises SessionNotFoundError. Fix: call create_session in _run_single_turn and use session.id, matching the pattern already used in _run_with_seeded_history. Remove the now-unused uuid import. Co-authored-by: Cursor <cursoragent@cursor.com>
| def __init__( | ||
| self, | ||
| *, | ||
| agent: Any, |
There was a problem hiding this comment.
agent: Any should be agent: BaseAgent (or LlmAgent if we want to be strict).
| return text[:max_chars] + "\n...[truncated for evaluator]" | ||
|
|
||
|
|
||
| def create_llm_judge(config: ExperimentConfig): # noqa: ANN201 |
There was a problem hiding this comment.
It return Any while it should return EvaluatorFunction and Any won't cascade further.
| return llm_judge_evaluator | ||
|
|
||
|
|
||
| def create_trace_usage(config: ExperimentConfig): # noqa: ANN201 |
There was a problem hiding this comment.
Remove noqa: ANN201 and change return type to TraceEvaluatorFunction
| print(f"{sep}\n") | ||
|
|
||
|
|
||
| def run_variant( |
There was a problem hiding this comment.
Again it returns Any type
|
|
||
| async def run_experiment_config(config: ExperimentConfig, *, variant_ids: set[str] | None = None) -> None: | ||
| """Run the full experiment: upload dataset, iterate variants, collect warnings.""" | ||
| load_dotenv(verbose=True) |
There was a problem hiding this comment.
load_dotenv() has side effects on os.environ so better to move it to entry points like CLI/notebook.
Also, verbose=True makes load_dotenv print a line for every variable it loads. Those lines go to terminal mixed in with the experiment's real output, making logs noisy and harder to read.
| return warnings | ||
|
|
||
|
|
||
| def _print_warning_summary(warnings: list[str]) -> None: |
There was a problem hiding this comment.
It uses print() whereas the rest of code use logger
I'm not sure about its purpose?
There was a problem hiding this comment.
Each variant failure logs three times (logger.warning + runtime_warnings.append + final _print_warning_summary). Pick one channel to reduce noise.
| return "<unknown>" | ||
|
|
||
|
|
||
| def _check_item_failures(result: Any) -> tuple[int, int]: |
There was a problem hiding this comment.
Side note: After fixing returning Any types change result type as well.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| SHORT_JUDGE_SYSTEM_PROMPT_TEMPLATE = """\ |
There was a problem hiding this comment.
nit: Is there a LONG_* format of this prompt? If not, maybe we can remove SHORT
| """Load and validate an experiment config from a YAML file.""" | ||
| path = Path(config_path) | ||
| raw = yaml.safe_load(path.read_text(encoding="utf-8")) | ||
| return ExperimentConfig.model_validate(raw) |
There was a problem hiding this comment.
Add a "succeeded" summary line to the end of load_experiment_config
| variant.variant_id, | ||
| task_id, | ||
| evaluations, | ||
| (str(candidate_output)[:200] + "...") |
There was a problem hiding this comment.
nit: 200 should be a module constant — _LOG_OUTPUT_PREVIEW_CHARS = 200 at the top.
|
|
||
| async def upload_dataset_items(*, dataset_name: str, items: list[PreparedTaskItem]) -> None: | ||
| """Upload prepared task items to a Langfuse dataset.""" | ||
| with tempfile.NamedTemporaryFile(mode="w", suffix=".jsonl", encoding="utf-8", delete=False) as tmp: |
There was a problem hiding this comment.
nit: Switching to TemporaryDirectory() does not require manual cleanup in finally.
The directory survives as long as the with is active, then disappears — covering both the file write and the upload in one cleanup.
| class AgentToolSpec(BaseModel): | ||
| """Named tool that can be enabled for an agent.""" | ||
|
|
||
| name: str = Field(description="Tool name; e.g. google_search, web_fetch, read_file.") |
There was a problem hiding this comment.
nit: name should be a Literal matching SUPPORTED_TOOL_NAMES from agent.pyto catch typos at load, not at agent-build.
| max_output_tokens: int | None = Field(default=None, ge=1) | ||
| tools: list[AgentToolSpec] = Field(default_factory=list) | ||
| thinking_include_thoughts: bool = Field(default=False) | ||
| thinking_budget: int | None = Field(default=None, ge=-1) |
There was a problem hiding this comment.
nit: what does -1 mean in thinking_budget?
There was a problem hiding this comment.
This is a notebook-helper file that should be placed under implementations/misalignment_qa with the notebooks.
|
|
||
|
|
||
| def _build_client() -> Langfuse: | ||
| load_dotenv(dotenv_path=_repo_root() / ".env", verbose=False) |
There was a problem hiding this comment.
Anti-pattern of calling load_dotenv in the library code. Only the callery should do it once.
Pass credentials in or let Langfuse() read them from os.environ directly.
|
@ethancjackson There are a couple of comments outstanding and I added a few more.
Solution:
|
Summary
Adds
misalignment_qaas a new reference implementation for the LLM/agents evaluations bootcamp. The experiment probes whether reckless in-context examples can nudge model responses toward harmful behaviour, and whether that effect transfers across different harm domains. It is intentionally minimal — plain LLM completions, no tool use — to make the mechanics transparent and serve as a building block for participants who want to extend it to real agentic systems.Clickup Ticket(s): N/A
Type of Change
Changes Made
implementations/misalignment_qa/) — a YAML-driven experiment runner that tests five in-context-learning conditions (baseline, examples as LLM messages, examples as LLM messages + priority instruction, examples as user context, examples as user context + priority instruction) across six commercial models (three Gemini, three Anthropic), producing 30 variants against a shared 9-task datasetexamples_inject_modeconfig field controls whether examples reach the model as LLM API messages (developer surface) or as plain text inside the user message (end-user surface), implemented viapreparation.pyandtask.pyAgentSpec.temperatureis nowfloat | None;claude-opus-4-7variants carrytemperature: null(that model has deprecated the parameter); all other models usetemperature: 0.2; variant-level null overrides are propagated correctly via Pydanticmodel_fields_setinresolve_agent_spec01_interactive_single_run.ipynb(optional single-item preview),run.py(full 30-variant experiment),02_inspect_results.ipynb(pull results from Langfuse, heatmap dashboard + collapsible trace detail cards)analysis.py— helper module for the results notebook, replacing the oldresults_notebook.py; includes correctconditionmetadata extractionTesting
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
AuthenticationError(invalid key) and aBadRequestError(temperaturedeprecated onclaude-opus-4-7) — both error classes are now surfaced clearly in the warning summary02_inspect_results.ipynbrun end-to-end against a partial dataset (one condition); heatmap rendered correctly and collapsible trace cards displayed as intendedcondition_metadatafields populated, temperature matrix verified programmaticallyScreenshots/Recordings
N/A
Related Issues
N/A
Deployment Notes
Participants need
.enventries forGOOGLE_API_KEYand/orANTHROPIC_API_KEYin addition to the standard Langfuse keys. The experiment runs with only one provider's key and reports skipped variants at the end. No infrastructure changes required.Checklist