add: LLM Prisoner's Dilemma — when agents reason about trust, not just strategy#378
add: LLM Prisoner's Dilemma — when agents reason about trust, not just strategy#378abhinavk0220 wants to merge 11 commits intomesa:mainfrom
Conversation
…terated Prisoner's Dilemma simulation where agents useLLM Chain-of-Thought reasoning to decide whether to cooperateor defect each round, instead of fixed strategies like tit-for-tator always-defect.Agents reason about partner history, trust signals, and long-termpayoff before deciding — producing emergent negotiation andtrust-building behavior that fixed-strategy models cannot capture.Payoff matrix:- Both cooperate: 3, 3- Defect vs cooperate: 5, 0- Both defect: 1, 1Visualization tracks cooperation rate and average score over rounds.Includes .env.example for Gemini, OpenAI, Anthropic, and Ollama.Reference: Axelrod, R. (1984). The Evolution of Cooperation.
for more information, see https://pre-commit.ci
|
Nice concept, Prisoner's Dilemma with LLM reasoning makes a lot of sense. Decisions are hardcoded in action1 = "cooperate"
action2 = "cooperate"
The agent design looks good though - payoff matrix, |
|
Thanks for the effort. Could you:
|
Thanks for the review @EwoutH! Will add the screenshot and move |
Thanks @AdityaChauhanX07 for the thorough review! All three issues are now fixed:
Glad the agent design looked clean just needed the reasoning hooked |
Hi @EwoutH! Both requests addressed:
Also peer review is complete. @AdityaChauhanX07 reviewed the PR and |
|
Thanks! Could you:
|
Hi @EwoutH! Updates done:
Also did a self-review on Files changed tab. READY TO MERGE !!!!!!! |
|
Thanks both for your initial efforts. You’re on the right track. However, it needs still needs a serious peer review and maintainer review. Showing how you handle this process is more important for GSoC than if the PR is merged or not. |
Thanks @EwoutH! Understood the process matters more than the merge. I'll get a more thorough peer review done on #378 using the #390 Also working on the proposal excited about both Mesa-LLM Iteration |
|
@abhinavk0220 |
|
Hi @abhinavk0220, Thank you for this impressive contribution! Moving the Prisoner's Dilemma from rigid, rule-based strategies to LLM-driven Chain-of-Thought reasoning is a fantastic showcase for the I've run the model locally, reviewed the code, and evaluated it strictly against the Mesa Examples #390 Review Guidelines. Here are my notes: Does it belong?Absolutely. It is well-scoped, doesn't unnecessarily overlap with existing examples, and perfectly demonstrates the emergent social dynamics (trust, betrayal, forgiveness) that happen when agents have memory and reasoning capabilities. Is it correct and current?The model visualizes correctly, but there are two critical vulnerabilities that need to be addressed to ensure both scientific correctness and UI stability.
Is it clean?The code is mostly very readable, but the developer experience could be slightly improved:
This is a really exciting example! Once the concurrent API calling and the parsing logic are hardened, this will be an excellent template for future LLM-based ABMs. Great work so far! |
- Robust action parsing: replace brittle substring match with regex on mandatory <ACTION>: COOPERATE/DEFECT tag in system prompt, preventing false defection when LLM says "I will not defect" - Concurrent LLM calls: use ThreadPoolExecutor so all agents reason in parallel before any outcome is applied, fixing Solara UI freeze and ensuring true simultaneity (no dirty reads between pairs) - Mesa time: replace custom round_number with super().step() so model uses Mesa's built-in time tracking via model.time Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
for more information, see https://pre-commit.ci
|
Thanks @ZhehaoZhao423 for the thorough review — all three points addressed in the latest commit: 1. Brittle output parsing — fixed 2. Sequential IO & simultaneity — fixed
3. Mesa time tracking — fixed |
|
@abhinavk0220 Summary: What I think about thisThis is a Details I saw
Improvements that can be done
Design questions
|
|
Does it belong? Is it correct and current? Is it clean? Summary |
- Add prisoners_dilemma_dashboard.png (round 1: all defect — Nash equilibrium) - Add prisoners_dilemma_initial.png (step 0 empty state) - Expand README Visualization section: explain why 100% defection in round 1 is the correct game-theoretic outcome, not a bug Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace placeholder screenshot with 5-round run showing cooperation collapse after exploitation — emergent Nash equilibrium lock-in with no hardcoded strategy. Expand README with round-by-round analysis table and connection to Axelrod (1984). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- prisoners_dilemma_initial.png: step 1 showing first cooperation attempt - prisoners_dilemma_dashboard.png: step 5 showing full arc (coop peak → collapse) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks @apfine for the detailed review and design questions addressing each point: Inline Documentation: Fair point. The Prompt Abstraction: Acknowledged moving Token/Cost Tracking: Good idea for a future enhancement not in scope for this PR but worth noting in the README as a suggested extension. Memory Decay: Already handled Cheap Talk (pre-decision communication): This is the most interesting design question. Mesa-LLM has a |
|
Thanks @savirpatil for the careful review two specific responses: On
On That import was removed in an earlier commit the current Appreciate the thorough read this is exactly the kind of review that improves the model. |
Summary
Adds an iterated Prisoner's Dilemma simulation where agents use LLM
Chain-of-Thought reasoning to decide whether to cooperate or defect
each round — instead of following fixed strategies like tit-for-tat,
always-defect, or random.
How this differs from classical Prisoner's Dilemma models
Classical PD models hardcode strategies — agents follow predefined
behavioral rules (tit-for-tat, always-defect, Pavlov) based on parameters.
This model takes a fundamentally different approach: agents have
no fixed strategy. Instead they use LLM Chain-of-Thought reasoning
to decide whether to cooperate or defect at each step based on their
interaction history, partner behavior, and long-term payoff reasoning.
The distinction matters because:
that fixed rules cannot capture
What makes this different from a classical PD model
In a standard iterated PD model, strategies are fixed at initialization.
Here, agents reason about their situation at each step — observing partner
history, weighing trust against exploitation risk — and choose an action:
cooperate— Work together for mutual benefit (3, 3 payoff)defect— Betray partner for personal gain (5, 0 payoff)Payoff matrix:
This produces emergent negotiation dynamics — reputation building,
trust signaling, strategic retaliation — that fixed rules cannot replicate.
Visualization — Emergent Game Theory from Pure LLM Reasoning
Round 1 — First cooperation attempt:
After 5 rounds of LLM-driven reasoning:
What the charts show — round by round:
Why this is significant: This is Axelrod's Evolution of Cooperation
(1984) core result reproduced with zero hardcoded strategy. No
tit-for-tat rule, no punishment parameter, no threshold. The LLM reasoned
its way to this behavior by reflecting on interaction history at each step.
How to Run
cp .env.example .env # fill in your API key pip install -r requirements.txt solara run app.pySupported LLM Providers
Gemini, OpenAI, Anthropic, Ollama (local) — configured via
.env.A
.env.exampleis included for easy setup.Reference
Axelrod, R. (1984). The Evolution of Cooperation. Basic Books.
GSoC contributor checklist
Context & motivation
Classical PD models use fixed strategies. Real strategic behavior is
driven by reasoning — agents assess trust, weigh past interactions, and
decide whether to cooperate or retaliate. This model replaces fixed
strategies with LLM reasoning to produce more behaviorally realistic
game dynamics.
What I learned
LLM agents independently converge to the Nash equilibrium in early
rounds (defect when no trust exists) and develop punishment behavior
after exploitation — without being told to. The emergent tit-for-tat
pattern from Axelrod's foundational work appears naturally from
language-based reasoning, not from a hardcoded rule.
Learning repo
🔗 My learning repo: https://github.com/abhinavk0220/GSoC-learning-space
🔗 Relevant model(s): https://github.com/abhinavk0220/GSoC-learning-space/tree/main/models/llm_prisoners_dilemma
Readiness checks
pytest --cov=mesa tests/)ruff check . --fix)AI Assistance Disclosure
This PR was developed with AI assistance (Claude) for code generation
and debugging. All code has been reviewed, tested, and understood by
the contributor.
Mesa Examples Review Checklist (#390)
Does it belong?
Is it correct?
DataCollector,Model).env.example)rngseed — LLM outputs are non-deterministic by naturellm/directoryIs it clean?