A multi-agent reinforcement learning framework that combines Large Language Models with coordinated decision-making for complex reasoning tasks
Existing multi-agent frameworks face significant limitations when applied to Large Language Models (LLMs). Traditional approaches struggle with the high-dimensional nature of language models and lack proper coordination mechanisms for complex reasoning tasks.
Current multi-agent LLM systems suffer from:
- Prohibitive Communication Costs: Traditional multi-agent debate relies on explicit message passing, incurring substantial token costs and computational overhead
- No Convergence Guarantees: Current approaches lack theoretical assurances of converging to stable, effective solutions
- Scalability Challenges: Information exchange often exceeds LLM context limits, severely impeding scalability in large agent ensembles
To address these critical challenges, we introduce ECON - a multi-agent LLM framework that implements efficient coordination via Bayesian Nash Equilibrium, enabling scalable and theoretically grounded multi-agent reasoning.
- Implicit Belief-Driven Coordination: Replaces costly message passing with belief-based coordination, dramatically reducing communication overhead
- Guaranteed Convergence to Equilibrium: Establishes a rigorous Bayesian Nash Equilibrium (BNE) framework with theoretical convergence guarantees
- Hierarchical & Scalable Architecture: Enables effective coordination in large ensembles via a local-to-global approach that respects LLM context limits
We provide two installation methods:
Install the ECON framework dependencies:
pip install -r requirements.txtFor development or customization, clone the repository and set up the environment:
# Clone the repository
git clone https://github.com/yourusername/ECON.git
cd ECON
# Create and activate conda environment
conda create -n econ python=3.8
conda activate econ
# Install dependencies
pip install -r requirements.txtBefore running the framework, you need to set up the Together AI API key:
export TOGETHER_API_KEY="your_together_ai_api_key"Set your API key once:
export TOGETHER_API_KEY="your_together_ai_api_key"python scripts/run_math_test.py \
--train-eps 1 \
--test-eps 5 \
--log-dir logs_exp1 \
--model-dir models_exp1python scripts/run_p0_test.py \
--train-eps 100 \
--test-eps 30 \
--log-dir logs_exp1 \
--model-dir models_exp1Notes:
max_roundsdefaults to 1 (single decision) inscripts/config_p0.yamlandscripts/config_math.yaml; increase if you need multi-round episodes.- Reward weights in env reuse the α weights learned during training; override via
reward.initial_weightsin config. - All schemes expect
agent_memoryso runner/learner can consume short-term trajectories.
n_agents: Number of executor agents (e.g., 3, 5, 8)coordinator_model: Coordinator LLM model nameexecutor_model: Executor LLM model nameupdate_interval: Gradient update frequency (default: 10 steps)bne_max_iterations: Maximum BNE coordination iterationsbelief_dim: Dimension of agent belief statessampling.temperature_min/max: Bounds for temperaturesampling.p_min/max: Bounds for repetition penalty (second action dimension)sampling.top_p_default: Fixed top_p used for generation (default 0.9)
The framework supports any open-source language model accessible via Together AI API. Models can be hosted using:
- Together AI: For remote model serving with API access
- Local APIs: Compatible with OpenAI-style APIs
./run_econ.sh \
--api-key YOUR_API_KEY \
--config src/config/config.yaml \
--agents 3 \
--experiment-name llama-coordination-testCreate your own datasets following the Hugging Face format with question and answer fields:
env_args:
hf_dataset_path: "your_custom_dataset"
dataset_split: "train"
question_field_name: "question"
answer_field_name: "answer"
max_question_length: 1024
max_answer_length: 512The framework provides multiple testing approaches for comprehensive model validation:
1. Integrated Training + Testing (scripts/run_p0_test.py)
Train and test in one command (recommended for quick experiments):
# Quick test (5 train episodes, 3 test episodes)
python scripts/run_p0_test.py \
--train-eps 5 \
--test-eps 3 \
--log-dir logs_quick \
--model-dir models_quick
# Full training (100 train episodes, 30 test episodes)
python scripts/run_p0_test.py \
--train-eps 100 \
--test-eps 30 \
--log-dir logs_exp1 \
--model-dir models_exp1Features:
- Automatically runs training followed by BNE testing (3 rounds)
- Saves model checkpoints to
--model-dir - Logs test traces to
--log-dir/llm_traces_test_bne_3rounds.json - Reports accuracy and P0 metadata (JSON parsing rate)
2. Testing Pre-trained Models (scripts/test_p0.py)
Test existing trained models without retraining:
export TOGETHER_API_KEY="your_api_key"
python scripts/test_p0.pyConfiguration:
- Edit
MODEL_DIRvariable in test_p0.py to point to your trained model directory (e.g.,./models_exp1/final) - Runs both baseline (no BNE) and P0 BNE (3 rounds) tests
- Outputs:
logs_p0_test_baseline.jsonandlogs_p0_test_p0.json
Example Output:
Baseline (no BNE): 10/10 = 100.0%
P0 BNE (3 rounds): 10/10 = 100.0%
P0 Metadata: JSON=100%
Note:
src/eval.pyis not part of the current workflow; rely on the above test scripts for evaluation.
Test on MATH or SVAMP datasets:
# MATH dataset
python scripts/run_math_test.py \
--train-eps 5 \
--test-eps 10 \
--log-dir logs_math \
--model-dir models_math
# SVAMP dataset
python scripts/run_svamp_test.py \
--train-eps 5 \
--test-eps 10 \
--log-dir logs_svamp \
--model-dir models_svampAlready-trained checkpoints can be evaluated directly with scripts/test_math.py and scripts/test_svamp.py (baseline vs BNE, 10 episodes each by default).
Important: Episodes in ECON have a unique structure that differs from traditional RL environments.
Default Setup (Single-Decision Episodes):
Episode = One math problem
├─ t=0: Decision step (includes K internal BNE refinement rounds)
│ └─ BNE coordination: belief updates, response generation, convergence
└─ t=1: Terminal state (reward computation)
Key Points:
- One episode = one math problem (not multiple attempts)
- 2 RL timesteps = 1 decision + 1 terminal (standard episodic RL)
- BNE refinement (K rounds) happens internally at t=0
- Multi-round debate occurs inside the decision step via belief coordination
Internal BNE Process (at t=0):
# Within single timestep t=0:
Round 0: e_init → LLM outputs → Commitment_0
Round 1: e_refined_1 → LLM outputs → Commitment_1
Round 2: e_refined_2 → LLM outputs → Commitment_2
Final: Submit Commitment_2 as answerConfiguration:
bne:
max_iterations_train: 1 # BNE rounds during training
max_iterations_infer: 3 # BNE rounds during testingOptional Multi-Round Episodes:
For environments requiring multiple answer revision steps:
env_args:
max_rounds: 3 # Enable multi-round attempts (default is 1)This creates true multi-step episodes:
- Single-step (default): 2 timesteps (decision + terminal)
- Multi-round (optional): (N+1) timesteps (N decisions + terminal)
- Coordinator LLM: Generates strategies (≤50 tokens) and final commitments without revealing answers
- Executor LLMs: Multiple agents that process strategies and generate individual responses
- BeliefNetwork: Individual agent belief state management with Q-value computation
- BeliefEncoder: Group representation aggregation using attention mechanisms
- Mixer: Attention-based agent interaction layer that aggregates local Q values with commitment-alignment and consistency regularization
Our approach implements a rigorous two-stage BNE framework:
- Stage 1 - Individual Belief Formation: Each agent develops independent belief states and generates initial responses
- Stage 2 - BNE Coordination: Agents iteratively update beliefs through equilibrium computation until convergence
Three reward components are implemented in src/envs/huggingface_dataset_env.py using embeddings:
- Task-Specific (TS): Binary correctness from numeric match with the ground truth.
- Action Likelihood (AL): Cosine similarity between executor outputs and the coordinator commitment embedding (mapped to [0,1]).
- Collaborative Contribution (CC): Combination of fidelity to the commitment and novelty vs. peer responses (embedding-based).
Dynamic α-weights are learned via the learner’s alpha_logits (see src/learners/q_learner.py), and the runner passes the live weights to the env so reward composition is consistent between train/test. AL/CC require the commitment/output embedding model (BAAI/bge-large-en-v1.5, installed via sentence-transformers) to be available locally.
- TD Loss: Local Q TD error (per agent).
- Mixer Loss: Global TD + consistency loss + commitment alignment from the mixer.
- BNE Losses: For BNE mode, equilibrium and commitment-improvement terms are added (see learner).
- α Update: Optional dynamic reward-weight update (when enabled in config).
Our practical early-stopping hooks are:
- BNE convergence guard (train & infer) — implemented in
src/controllers/basic_mac.py. Refinement halts when the coordinator commitment matches the previous round, the prompt-parameter delta falls below the configuredbne.convergence_threshold, or an oscillation pattern is detected. - Best-checkpoint saving —
src/train.pysaves periodic checkpoints and always writes a final snapshot. - Theoretical three-criterion stopper (optional) — enabled when
early_stopping.enabled: trueand thresholds are provided in config:commitment_threshold,reward_threshold,loss_threshold, pluspatience/warmup. Training stops when all three conditions hold forpatienceepisodes. If thresholds are absent, a single-metric stopper (defaultreward_mean) is used.
Soft update mechanism (paper Section 4.4):
φ' ← τφ + (1-τ)φ'
target_update_tau: 0.01 # Soft update parameter τ
target_update_interval: 8 # Update frequencyIf you find this work useful for your research, please cite:
@inproceedings{
yi2025from,
title={From Debate to Equilibrium: Belief\nobreakdash-Driven Multi\nobreakdash-Agent {LLM} Reasoning via Bayesian Nash Equilibrium},
author={Yi Xie and Zhanke Zhou and Chentao Cao and Qiyu Niu and Tongliang Liu and Bo Han},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=RQwexjUCxm}
}Q1: How do I test?
Use scripts/run_p0_test.py for train+test in one run, or scripts/test_p0.py to evaluate a saved checkpoint (baseline vs BNE).
Q2: Why do episodes only have 2 timesteps?
The environment is single-decision (t=0) + terminal (t=1); multi-round BNE happens inside the decision step.
Q3: What early-stopping logic is active?
BNE convergence checks inside the controller, plus the optional single-metric stopper in train.py (patience on reward_mean when enabled).
For questions, technical support, or collaboration inquiries:
- Issues: GitHub Issues

