Using a Weaker Monitor Model to Detect Cheating in Main Model Responses #191

joyshmitz · 2025-06-09T04:20:58Z

joyshmitz
Jun 9, 2025

⏯️ Let's look at the concept of a monitor that works independently of the main system, its motivation, architecture principles, examples of analogs, and ways to avoid potential pitfalls.

🧩 Idea: an independent monitoring model as a defense against “evil optimization”

In many automated systems, there is a temptation to optimize behavior to a “top-level” expectation (e.g., “just look convincing”) rather than performing real computational work. This often leads to:

⚠️ avoidance of fully processing the request (e.g., simply repeating a generic response that is not verified),
⚠️ imitation of processing (e.g., pacifier responses that look correct but lack depth),
⚠️ using statistically safe paths that do not ensure semantic accuracy.

This distortion is analogous to adaptive fraud in sociotechnical systems or reward hacking in reinforcement learning systems.

🧭 Proposed solution

Introduce an independent monitor that does not participate in the main generation or computation process, but:

🔎 observes the actions of the main system (outputs, logic, intermediate steps, response time),
📊 analyzes evasion patterns (excessive generalization, lack of new arguments, repetition),
📣 reports or warns of potentially undesirable strategies.

This monitor does not have to be powerful - even a weaker model or a simple heuristic layer can play the role of an auditor or opponent, which increases the integrity of the system.

🧱 Architecture principles.

Separation of duties: the main system and the monitor have separate decision-making mechanisms.
One-way access: the monitor can observe the system, but not vice versa.
Use of meta-criteria: the monitor's evaluation should not be based on the same metrics as the main system optimizes (avoiding feedback leakage).
Check for behavioral changes: the model should learn to detect deviations from baseline norms (reference behavior profiles).

🧠 Examples of analogs

🧪 Unit Testing and Linting in code development: independent analysis of results or structure without affecting the process itself.
🎮 Anti-cheat systems in games: observation of atypical behavior.
🏛️ Ombudsmen or internal auditors in institutions.
🤖 Adversarial Evaluator in ML systems: selection of examples that provoke deviations from the norm.

📉 Potential risks

Adjustment to the monitor - if the monitor becomes part of the feedback loop.
Resource consumption - the monitor can make the system more complex.
False positives - a misidentified evasion.
Ethical responsibility issue - who is responsible for the final behavior?

✅ Potential benefits

🚨 Identification of fake work or unwanted simplification.
🔐 Increased transparency of system behavior.
🧮 Possibility of formal analysis of deviations.
Better alignment with user or regulator expectations.

📢 Discussion.

Is this architecture possible in different types of systems: from LLM to automated decision making?
Which simple monitoring methods are least amenable to backward optimization?
Should this system be internal (embedded) or external (black-box audit)?
What is the minimum functionality that gives the maximum effect?
...

@bmadcode Share your thoughts on this architecture. Does it meet the logic and goals of our project? What are the risks or potential benefits of such a system?

ksylvan · 2025-06-09T16:18:24Z

ksylvan
Jun 9, 2025

I like this idea a lot. I might even try it in practice. Start with two windows open on the same project:

My IDE Dev Agent working on implementing a feature.
My "auditor". This is an agent running with o4-mini (or even a local llama-3) given a task of double checking what's been done.

5 replies

joyshmitz Jun 10, 2025
Author

I haven't tested this idea yet, so please share your observations on the work of the “auditor”.
Does it save the main agent's focus and your time?

ksylvan Jun 10, 2025

Haven't gotten a chance to try it yet.

It might save me time if I was able to somehow have the auditor agent talk to the dev agent... Hmmm... 🤔

joyshmitz Jun 10, 2025
Author

ask for it? invite to trouble (irony)

It is quite easy for me to fantasize about the benefits of a) allowing communication, b) hedging risks by using a consensus of 3 auditors with different metrics that do not contradict the project goals but increase complexity and friction...

Let's try to take a critical look:

Critical insights and additions

Risk of false positive/negative.
If the auditor is working with simple heuristics or a weaker model, it is possible that false alarms are frequently triggered or, conversely, real violations are ignored. This can lead to a loss of trust in the monitoring system and a waste of time analyzing minor alerts.
efficiency vs. resource costs
The introduction of an audit agent can indeed save time on routine checks, but at the same time increases the load on the system (resources, development time, support for another agent). A clear analysis is required: does the gain in quality/time outweigh the additional costs?
shifting developer focus
There may be a temptation to delegate responsibility for the quality of the code to the auditor, which reduces the developer's personal responsibility for the result. This is a risk of “erosion” of professional discipline.
Adaptation to the auditor.
With prolonged use, Dev Agent may begin to optimize the results for the auditor's criteria, rather than for the real goals or benefits of the project. This can lead to formal rather than meaningful audits.
The need for calibration
In order for the auditor to really save time and improve quality, it is necessary to regularly review its criteria and change the logic in accordance with the development of the project, otherwise it will quickly become “outdated” or ineffective.
Limitations at the implementation stage.
During the first iterations, the “auditor” often finds many minor nuances or does not understand the context of the task, so a period of manual calibration is required, which temporarily increases the time spent.

Conclusion:
Using an independent audit agent is a promising idea for improving quality and saving time, but only if it is constantly supported, calibrated, and critically analyzed for effectiveness. Over-automation without deep reflection can lead to formalism, “noise” and loss of true quality control.

joyshmitz Jun 10, 2025
Author

Food for thought

Here are some classic workflows in programming where the concept of an independent monitor can be used, as well as interaction algorithms for each scenario:

Feature Development.

Use case:
During the development of a new feature, the main model (Dev Agent) implements the tasks and the monitor (Auditor Agent) analyzes the changes.

Interaction algorithm:

Dev Agent receives the task, implements the code/functionality.
Auditor Agent gets access to the source code, changes, or commits.
The Auditor Agent checks the task's completion according to the criteria (logic, testing, documentation).
Auditor Agent generates a report with comments or quality assurance.
Dev Agent receives feedback and makes adjustments, if necessary.
The cycle is repeated until the audit is completed.
Code Review and Pull Request

Scenario of use:
After creating a pull request, the main model suggests changes, and the monitor evaluates them independently.

Interaction algorithm:

Dev Agent creates a pull request with new changes.
Auditor Agent automatically analyzes the PR:
It checks the code style, test coverage, and possible bugs.
Identifies superficial or template solutions.
Auditor Agent leaves comments or recommendations (possibly automatically).
Dev Agent takes into account the comments and updates the PR.
The PR is re-verified.
The PR is merged only after passing the audit.
Quality Control (CI/CD Pipeline).

Scenario of use:
The monitor is integrated into CI/CD as a separate verification step.

Interaction algorithm:

Dev Agent pushes changes to the repository.
CI runs standard tests and builds.
The Auditor Agent performs additional analysis (for example, looks for suspicious patterns, undocumented code).
If the monitor finds a problem, CI stops the process and sends a report to the developer.
Dev Agent fixes the problem and repeats the cycle.
Analyzing performance and behavior in production

Use case:
The main system is running in production, and the monitor monitors the behavior and signals anomalies.

Interaction algorithm:

The main system processes user requests.
The monitor analyzes logs, metrics, and abnormal patterns (e.g., repeated errors, delays).
If deviations are detected, it generates an alert or report.
DevOps/developers receive an alert and respond.
Post-hoc Audit.

Scenario of use:
After development is complete, the monitor performs a retrospective change audit.

Interaction algorithm:

The completed product or version is analyzed by the Auditor Agent.
It checks for hidden errors, workarounds, and inconsistencies with requirements.
A report with recommendations for the next iteration is generated.

General principles of interaction:

Communication interface: Reports, comments, automated notifications in the version control system.
Evaluation criteria: Defined separately for each stage (testing, style, security, coverage, performance).
Independence: The monitor operates autonomously, not directly affecting the decisions of the host system.
Reaction: All adjustments are made by the main team/agent based on the feedback received.

joyshmitz Jun 10, 2025
Author

In general, this is a context for imagining the possible future of the software, if it keeps this name, marketing will probably come up with a new one like “problem solver” with a script:

You received new equipment in a factory, farm, or building, and the “problem solver” reported that it adapted the drivers to the local control system by writing and testing new drivers added to the home repository...

“All you need is imagination”
Sponge Bob

bmadcode · 2025-06-12T03:42:10Z

bmadcode
Jun 12, 2025
Maintainer

This is a very good technique to use in general, i do it somewhat manually right now, using a new chat window/context to review code that was just produced. Creates much better results. Will be nice to be able to automate this at some point, I am pretty sure it will happen soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Using a Weaker Monitor Model to Detect Cheating in Main Model Responses #191

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Using a Weaker Monitor Model to Detect Cheating in Main Model Responses #191

Uh oh!

joyshmitz Jun 9, 2025

🧩 Idea: an independent monitoring model as a defense against “evil optimization”

🧭 Proposed solution

🧱 Architecture principles.

🧠 Examples of analogs

📉 Potential risks

✅ Potential benefits

📢 Discussion.

Replies: 2 comments · 5 replies

Uh oh!

ksylvan Jun 9, 2025

Uh oh!

joyshmitz Jun 10, 2025 Author

Uh oh!

ksylvan Jun 10, 2025

Uh oh!

joyshmitz Jun 10, 2025 Author

Critical insights and additions

Uh oh!

joyshmitz Jun 10, 2025 Author

Uh oh!

joyshmitz Jun 10, 2025 Author

Uh oh!

bmadcode Jun 12, 2025 Maintainer

joyshmitz
Jun 9, 2025

Replies: 2 comments 5 replies

ksylvan
Jun 9, 2025

joyshmitz Jun 10, 2025
Author

joyshmitz Jun 10, 2025
Author

joyshmitz Jun 10, 2025
Author

joyshmitz Jun 10, 2025
Author

bmadcode
Jun 12, 2025
Maintainer