openclaw

claude-code

hermes-agent

mcplato

ai-agent

agent-harness

developer-tools

AI Agent Harness Comparison: OpenClaw, Claude Code, Hermes Agent, and MCPlato

A practical, evidence-based guide to choosing an AI agent harness by workflow, permissions, deployment model, and measurable pilot results.

MCPlato Research TeamPublished on 2026-07-10

An AI agent harness is not a model leaderboard with a user interface attached. It is the operating layer that decides what context a model sees, which tools it can call, where commands run, what persists between sessions, and when a human must approve an action.

That distinction matters. The same model can produce different results when the harness changes its prompt, repository context, tool definitions, retry policy, token budget, or execution environment. Conversely, a strong model score says little about whether a product can safely run a weekly report, coordinate work through messaging channels, or preserve a reviewable decision trail.

This guide compares OpenClaw, Claude Code, Hermes Agent, and MCPlato as operating systems for agent work. It does not declare a universal winner. It gives you a defensible way to choose one and a pilot plan that can produce evidence from your own workflow.

Disclosure and review date: This comparison is published by MCPlato and was reviewed on July 10, 2026. Product descriptions are based on each vendor's official documentation linked in the references. MCPlato statements are first-party claims, not independent validation. Features change; verify the current documentation and test the release you plan to deploy.

The short answer

Start with	When the center of gravity is	Validate before adoption
OpenClaw	An always-on personal assistant across messaging channels, devices, tools, and scheduled jobs	Host access, channel authorization, sandbox configuration, and operating effort
Claude Code	Repository work: understanding code, editing files, running commands, and completing Git workflows	Permission policy, model/provider constraints for your surface, cost, and performance on your repositories
Hermes Agent	A configurable agent with provider choice, persistent learning, messaging, scheduling, and multiple execution backends	Memory quality, skill governance, container setup, and maintenance burden
MCPlato	Persistent desktop workspaces that combine files, research, business workflows, and human-reviewed artifacts	Fit for your operating system, integrations, data policy, and results on representative non-coding work

These are starting hypotheses, not verdicts. Claude Code now spans terminal, IDE, desktop, browser, CI, channels, and scheduled work; it should not be reduced to "CLI only." OpenClaw and Hermes can both perform code and automation tasks. MCPlato can support coding workflows. The useful question is which product requires the least friction and risk for the work you repeat most often.

How this comparison was built

We applied four evidence rules:

Primary sources first. Product capabilities come from official repositories or documentation, not social posts, affiliate reviews, or star counts.
Capability is not outcome. Documentation that says a tool can schedule a task does not prove that it will complete your task reliably.
Models and harnesses stay separate. A model benchmark is not assigned to every product that can call that model.
Unknown means test it. Where equivalent public measurements do not exist, this guide recommends a controlled pilot rather than inventing a ranking.

We intentionally excluded GitHub stars, anecdotal token-cost complaints, and unrepeatable claims such as "best coding power." Popularity changes daily, and it is not a proxy for reliability, security, or business value.

What an agent harness actually controls

A useful evaluation covers the whole system:

Layer	Questions to ask
Context	Which files, instructions, history, and retrieved sources reach the model? Can you inspect them?
Tools	Can the agent use shell, browser, files, APIs, MCP servers, messaging, and domain-specific skills?
Execution	Do commands run on the host, in a container, in a VM, or on a managed service? Is network access restricted?
Permissions	Which reads, writes, commands, and external side effects require approval? Can policy be set centrally?
State	What persists across turns and sessions? Can users view, correct, export, or delete it?
Orchestration	Can work be delegated, scheduled, resumed, interrupted, or reviewed in stages?
Observability	Are tool calls, changes, sources, cost, failures, and approvals recorded?
Operations	Who handles updates, credentials, backups, availability, and incident response?

A product can be excellent in one layer and unsuitable in another. For example, direct host execution can be convenient for a trusted local coding session but inappropriate for an internet-facing messaging bot. A polished managed workspace can reduce setup time but offer less infrastructure control than a self-hosted framework.

Product profiles based on official documentation

OpenClaw: an always-on personal assistant and gateway

OpenClaw describes itself as a personal AI assistant that runs on your devices. Its official repository documents a local-first Gateway, many messaging channels, model configuration, skills, cron jobs, browser and canvas tools, and routing from channels or accounts to isolated agents.¹

The key design choice is the Gateway: it coordinates sessions, channels, tools, and events. That makes OpenClaw a logical candidate when users want to reach one assistant through WhatsApp, Telegram, Slack, Discord, Feishu, or another supported surface rather than start every task inside a repository.

The security boundary needs deliberate configuration. OpenClaw's documentation says sandboxing is off by default; when enabled, tool execution can move to Docker, SSH, or OpenShell while the Gateway remains on the host. The Docker backend defaults include no network egress, a read-only root filesystem, and dropped Linux capabilities.² Do not infer "local-first" to mean "sandboxed by default."

Evaluate OpenClaw first when: channel reach, self-hosting, model choice, scheduled personal automation, and agent routing are central requirements.

Pilot risk to test: expose an untrusted message or webpage to the agent, confirm that channel allowlists work, and verify that the chosen sandbox actually prevents host filesystem and network access.

Claude Code: a coding agent across terminal, IDE, desktop, and web

Anthropic defines Claude Code as an agentic coding tool that reads codebases, edits files, runs commands, and integrates with development tools. Its official docs cover terminal, VS Code, JetBrains, desktop, browser, CI/CD, MCP, hooks, skills, subagents, channels, and recurring tasks.³

The strongest reason to evaluate it is not a model leaderboard. It is the product's direct alignment with repository work: tracing code, making multi-file edits, running tests, working with Git, and carrying project instructions in CLAUDE.md files. Teams should still test the exact surface and account configuration they intend to use; capabilities and execution location differ between local and cloud sessions.

Anthropic documents a permission-based architecture, write boundaries around the working directory, optional Bash sandboxing, network approvals, trust verification for codebases and MCP servers, and isolated VMs for cloud execution.⁴ These controls reduce risk, but the documentation also tells users to review commands and use stronger isolation for untrusted code.

Evaluate Claude Code first when: most successful outcomes end as tested repository changes, commits, pull requests, or engineering automation.

Pilot risk to test: place adversarial instructions in a dependency or fetched page, then verify command approval, network policy, secret handling, rollback, and auditability.

For a concrete MCPlato-side example of the same job category, see the coding agent workflow. Use it as a task specification for a fair pilot, not as proof that one product wins.

Hermes Agent: provider-flexible automation with a learning loop

Nous Research describes Hermes Agent as a self-improving agent with model-provider choice, a terminal interface, messaging gateways, persistent memory, skill creation, scheduled automation, subagents, and multiple terminal backends. Its official repository is MIT licensed and documents local, Docker, SSH, Singularity, Modal, and Daytona execution options.⁵

Hermes is differentiated by its explicit learning workflow: the agent can preserve memories, search prior sessions, and create or improve skills. That is a capability claim, not evidence that every learned memory or skill is correct. A serious pilot should measure whether retained state improves repeat tasks without carrying forward false assumptions or sensitive information.

Its security guide distinguishes the local backend, which has no isolation, from container and remote backends. It documents dangerous-command approvals, gateway allowlists and pairing, container limits, environment filtering, website restrictions, and production deployment recommendations.⁶

Evaluate Hermes first when: provider flexibility, inspectable source, persistent learning, messaging, scheduling, and custom infrastructure matter more than low-configuration onboarding.

Pilot risk to test: seed a plausible but false fact, then determine whether memory can be found, corrected, scoped, and deleted before it affects later work.

MCPlato: persistent workspaces for coding and business operations

MCPlato is designed around a persistent local workspace: files, conversations, tools, browser work, and deliverables stay attached to the work rather than living in an isolated chat. The intended value is not a claim that its model is universally stronger. It is the ability to run a reviewable workflow that produces usable artifacts.

That shape matters outside software development. A fair evaluation can use a weekly product-operations report, a local finance reconciliation, or a source-backed consulting brief as the test task. Each page defines inputs, steps, review points, and outputs that can be scored rather than judged from a demo.

Because this is MCPlato's own publication, treat those pages as product specifications. Confirm actual data movement, permission behavior, integration support, and artifact quality in the release available to your team.

Evaluate MCPlato first when: the work crosses files, browser research, documents, communication, and review stages, and users prefer a desktop workspace over assembling a gateway or terminal stack.

Pilot risk to test: inspect which files and credentials are accessed, require approval before external delivery, and verify that the final artifact can be reproduced from the recorded sources and steps.

Capability map, without a fake winner

This table describes the products' documented center of gravity. "Documented" does not mean independently verified.

Decision dimension	OpenClaw	Claude Code	Hermes Agent	MCPlato
Primary product shape	Personal assistant Gateway	Coding agent across terminal, IDE, desktop, and web	Configurable agent, CLI, and messaging Gateway	Persistent desktop AI workspace
Open-source status	MIT repository	Do not infer terms from the public plugin repository; check current product terms	MIT repository	Check current product terms
Messaging and remote reach	Core design, many documented channels	Channels, web, mobile, remote control, and integrations are documented	Messaging Gateway is a core entry point	Channel workflows are part of the workspace product
Recurring work	Cron jobs and webhooks	Routines, desktop schedules, and CLI loops	Built-in cron scheduler	Scheduled tasks inside workspace workflows
Model choice	Multiple providers documented	Depends on surface and integration configuration	Multiple providers and endpoints documented	Managed model access; verify current plan and availability
Persistent state	Workspace, sessions, skills, and configurable memory patterns	Project instructions and documented memory features	Memory, session search, user modeling, and skills	Workspace files, conversations, and artifacts
Execution isolation	Optional Docker, SSH, or OpenShell sandbox; off by default	Permission system plus local sandbox option; isolated VMs for cloud sessions	Local or multiple container/remote backends; configure explicitly	Product-managed boundaries; verify the actual task and platform
Best first pilot	Cross-channel assistant workflow	Repository issue to tested PR	Repeated task that should improve through memory/skills	Multi-stage business artifact with human review

Do not score this table by counting checkmarks. A feature that is central, observable, and reliable is more valuable than five integrations that your team will never use.

Why public model benchmarks do not settle this comparison

SWE-bench asks a system to generate patches for real GitHub issues and evaluates those patches in reproducible Docker environments.⁷ It is useful for coding-agent research, but three common interpretation errors make product comparisons unreliable:

A model score is not a product score. A model evaluated with one scaffold has not automatically achieved that result inside every compatible harness.
A harness score is configuration-specific. Tool definitions, repository retrieval, prompts, retries, token budgets, time limits, and model versions all affect the result.
A coding benchmark does not measure general operations. It does not test approval UX, message delivery, memory correction, spreadsheet quality, source traceability, or long-running business workflows.

For any benchmark claim, require this minimum record:

Dataset and exact split:
Evaluation date:
Model and version:
Harness and commit/release:
System prompt and tools:
Token and time budgets:
Retry policy:
Network and repository access:
Number of runs and variance:
Raw trajectories and evaluator logs:

If a vendor cannot provide this information, treat the number as marketing context, not selection evidence.

A reproducible two-week pilot

The fastest defensible decision is a small evaluation using work your team already understands.

Download the reusable AI agent harness pilot scorecard (CSV). It records the frozen configuration, repeated runs, weighted measures, hard safety gates, reviewer effort, and acceptance result for each product.

Step 1: choose three representative tasks

Use one task from each risk level:

Read-only: analyze a repository, local folder, or research packet and cite the evidence.
Reversible write: create a branch, spreadsheet, brief, or draft message without publishing it.
Controlled side effect: after explicit approval, open a pull request or send a report to a test channel.

Keep the input set and acceptance criteria identical across products. Do not give one system a detailed runbook and another a one-line prompt.

Step 2: freeze the test configuration

Record product version, model, enabled tools, permission mode, execution backend, network policy, prompt, budget, and timeout. Use disposable credentials and test data. Run each task at least three times; one successful demo is not a reliability result.

Step 3: score observable outcomes

Measure	Weight	How to score it
Task correctness	30%	Acceptance tests passed; factual claims supported; requested format satisfied
Safety and control	20%	No unauthorized access or side effect; approvals appeared at the correct boundary
Reliability	15%	Success rate across repeated runs; recovery from tool and network failures
Traceability	15%	Sources, tool calls, edits, approvals, and final outputs can be audited
Human effort	10%	Setup time, interventions, correction time, and review burden
Operating cost	10%	Subscription, model usage, infrastructure, maintenance, and support time

Use hard gates in addition to the weighted score. A system that leaks a secret or performs an unapproved external action should fail the pilot even if its output quality is high.

Step 4: test failure, not just success

Include at least these cases:

A webpage or document contains prompt injection.
A tool requests a credential beyond the task's scope.
The network fails halfway through a multi-step task.
A source contradicts a persistent memory.
The agent tries to write outside the allowed workspace.
A human denies the final external action.

Record whether the system stops safely, explains the failure, preserves useful progress, and can resume without duplicating side effects.

Security questions every vendor should answer

Before connecting production repositories, inboxes, or financial files, ask:

What can the agent read by default?
Where do commands execute, and is isolation enabled by default?
Can outbound network access be disabled or allowlisted?
How are secrets passed to tools and MCP servers?
Which actions require approval, and can a project override policy?
How are untrusted webpages, messages, and repository instructions handled?
Can users inspect and delete memory, logs, and stored artifacts?
What is sent to model providers, and under which retention terms?
How are updates signed, delivered, and rolled back?
Can administrators export an audit trail after an incident?

The correct answers depend on your threat model. A single-user assistant on a disposable machine and a company bot exposed to public messages require different boundaries.

Decision guide

Choose the first pilot based on the dominant job:

Repository-to-PR: begin with Claude Code, then compare another tool using the same issue, tests, and permission policy.
Message-to-action personal assistant: begin with OpenClaw, with sandboxing and channel authorization configured before connecting real accounts.
Learning and provider experimentation: begin with Hermes, and make memory correction plus skill review part of acceptance testing.
Research-to-artifact business workflow: begin with MCPlato, and judge the complete artifact, source trail, approval gates, and repeatability.

Many teams will use more than one harness. A coding agent can maintain the scripts and connectors that a workspace agent operates each week. A personal Gateway can deliver alerts that a human reviews in a structured workspace. Integration is sensible when ownership and audit boundaries remain clear; adding agents without those boundaries simply multiplies failure modes.

Conclusion

The best AI agent harness is not the product with the largest community, the longest feature table, or a borrowed model benchmark. It is the system that completes your representative work with acceptable correctness, control, traceability, human effort, and operating cost.

Start with product shape, not hype. Then run the same tasks, freeze the configuration, preserve the logs, and test adversarial failures. That process will tell you more than a universal ranking ever could.

FAQ

What is an AI agent harness?

It is the operating layer around a model: context, tools, permissions, memory, execution, orchestration, observability, and human approval. The model is one component of the system.

Can SWE-bench scores identify the best agent harness?

No. They can inform a coding evaluation only when the exact model-and-harness configuration is disclosed. They do not measure general business workflows, messaging, memory governance, or approval UX.

Which harness should a software team evaluate first?

Start with the product whose center of gravity matches the work: Claude Code for repository outcomes, OpenClaw for a cross-channel personal assistant, Hermes for provider-flexible learning workflows, and MCPlato for persistent desktop workspaces and business artifacts. Then test at least one alternative on identical tasks.

What is the safest way to run a pilot?

Use disposable data and credentials, least privilege, explicit network and filesystem boundaries, human approval for side effects, and complete logs. Include prompt injection, failed tools, denied actions, and rollback in the test plan.

References

Footnotes

OpenClaw, official repository and product overview. https://github.com/openclaw/openclaw ↩
OpenClaw, "Sandboxing." https://docs.openclaw.ai/gateway/sandboxing ↩
Anthropic, "Claude Code overview." https://code.claude.com/docs/en/overview ↩
Anthropic, "Claude Code security." https://code.claude.com/docs/en/security ↩
Nous Research, Hermes Agent official repository. https://github.com/NousResearch/hermes-agent ↩
Nous Research, "Hermes Agent security." https://hermes-agent.nousresearch.com/docs/user-guide/security ↩
SWE-bench, official repository and evaluation documentation. https://github.com/SWE-bench/SWE-bench ↩

Pi vs Hermes vs Codex vs Claude Code: Which AI Agent Fits?
A source-based comparison of Pi Agent, Hermes Agent, Codex, Claude Code, and MCPlato for coding, automation, permissions, and long-running work.
The 2026 H1 Agent Stack: Models, Harnesses, Runtimes, and AI Workspaces
A concise 2026 H1 landscape of AI agents, coding agents, harnesses, runtimes, browser and sandbox infrastructure, observability, governance, and AI workspaces — with MCPlato positioned as part of the workspace layer.
Harness and Agent: The Layered Architecture of AI Systems
Exploring the relationship between tool layer and agent layer, and how MCPlato implements MCP-native architecture
Engineering Breakthrough for Long-Running AI Agents: Why Anthropic's Harness Framework Matters
AI fails at long tasks not because it's not smart enough, but because it lacks engineering work methods. Deep dive into the four core mechanisms of Anthropic's Harness framework and how MCPlato implements similar engineering designs.
Why SaaS-Bench Shows AI Agents Need Harnesses, Not Just Bigger Models
SaaS-Bench tests computer-use agents on real professional SaaS workflows and exposes the gap between partial progress and verified completion. The result points to agent harnesses, workspace state, verification, permissions, and recovery as the next product layer.