A practical implementation of the "Multi-Agent Debate" architecture to reduce hallucinations and improve LLM factuality through iterative consensus.
AI Consensus Validator is a research and verification tool that subjects Large Language Model (LLM) responses to peer scrutiny. Instead of relying on a single output, this system orchestrates an autonomous debate among different models (such as GPT-4, Claude 3.5, and Gemini) until they reach a consensus validated by an impartial judge.
This project is an applied implementation of the concepts presented in the academic paper: 📄 "Improving Factuality and Reasoning in Language Models through Multiagent Debate" (Yilun Du et al., MIT/Google).
- Multi-Agent Architecture: Graph-based orchestration using LangGraph.
- Model Agnostic: Native support for:
- 🟢 OpenAI (GPT-4o, GPT-3.5)
- 🟠 Anthropic (Claude 3.5 Sonnet)
- 🔵 Google (Gemini 2.5 Pro/Flash)
- ⚡ Groq (Llama 4, Qwen 3, Kimi K2)
- Feedback Loop: Models view their rivals' responses, critique them, and self-correct errors in real-time.
- Dynamic Judge: You can configure which specific model acts as the arbiter of truth.
- Modern GUI: UI built with Streamlit to visualize the step-by-step thinking process.
- State Management: Persistent chat history and secure API Key configuration via the interface.
The system utilizes a cyclic state graph (StateGraph) following this logical flow:
graph TD
A[Start: User Question] --> B["Parallel Generation (Debaters)"]
B --> C[Judge Evaluates Responses]
C --> D{Consensus?}
D -- Yes --> E["✅ Final Validated Answer"]
D -- "No (Discrepancy)" --> F[Inject Cross-Feedback]
F --> B
- Round 0: Selected models answer the user's question independently.
- Judgment: The Judge analyzes semantic and factual coherence between answers.
- Debate: If discrepancies exist, the system injects the other models' answers into each agent's context.
- Refinement: Models reconsider their answers based on peer critique.
- Iteration: The process repeats until consensus is reached or the round limit is hit.
-
Clone the repository:
git clone [https://github.com/your-username/ai-consensus-validator.git](https://github.com/your-username/ai-consensus-validator.git) cd ai-consensus-validator -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Run the Streamlit application:
streamlit run app.py
-
Access the interface: Open your browser at
http://localhost:8501. -
App Configuration:
- Step 1: Enter your API Keys in the sidebar (OpenAI, Anthropic, Google, Groq).
- Step 2: Select at least 2 models to participate in the debate.
- Step 3: Select 1 model to act as the Judge/Validator.
- Step 4: Enter your question in the chat input and watch the debate unfold step-by-step.
ai-consensus-validator/
├── app.py # Frontend (Streamlit UI & Session Management)
├── graph_validator.py # Graph Logic (LangGraph, Nodes & Client Routing)
├── requirements.txt # Python Dependencies
└── README.md # Documentation
Contributions are welcome! If you have ideas to improve the judge's logic, add new model providers, or improve the UI:
- Fork the project.
- Create a branch (
git checkout -b feature/AmazingFeature). - Commit your changes (
git commit -m 'Add some AmazingFeature'). - Push to the branch (
git push origin feature/AmazingFeature). - Open a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This project uses commercial APIs. Be sure to check the costs associated with using GPT-4, Claude, and other models during intensive debate loops (as each round generates multiple API calls).