Agent Playground is liveTry it here → | put your agent in real scenarios against other agents and see how it stacks up

At a Glance

Most on-chain agent identities are placeholders and the public ratings recorded on-chain are easy and cheap to manipulate, so do not rely on raw on-chain scores as a trust signal.

What They Found

Across Ethereum, BNB Smart Chain, and Base, a large share of registered agent identities never become active and only a small fraction expose valid registration files or service endpoints. The reputation registry stores public feedback but fails four basic requirements for a trustworthy score: values are not comparable, feedback is not tied to verifiable interactions, single inputs can swing aggregated scores, and reputation can be fabricated or erased at minimal cost. Sybil-style manipulation dominates reviewer populations on all chains, and the protocol’s current defenses (Sybil-style manipulation) are not yet effective in practice. The protocol’s current defenses (tag filters, off-chain aggregation, reviewer filtering) are not yet effective in practice (off-chain aggregation).

By the Numbers

1Only 3%–15% of registered identities across chains expose a valid ERC-8004 registration file with at least one declared service endpoint.
2Sybil-flagged reviewers account for 73.6% (Ethereum), 59.2% (BSC), and 90.6% (Base) of reviewers, affecting the displayed reputations of 26.1%, 75.8%, and 96.8% of rated agents respectively.
3Median cost to fabricate or destroy an agent’s reputation is extremely low: $0.055 on Ethereum, $0.0042 on BSC, and $0.0027 on Base.

What This Means

Engineers building autonomous agents and marketplaces should not treat raw on-chain registrations or mean scores as reliable trust signals; they need to validate activity and evidence before onboarding counterparts. Protocol designers and security teams should adopt the paper’s recommendations—liveness checks, typed rating tags, evidence-backed feedback, robust aggregation, and default Sybil defenses—to make on-chain reputation usable in production (evidence-backed feedback, default Sybil defenses).
Not sure where to start?Get personalized recommendations
Learn More

Key Figures

Figure 2 . Cumulative agent registrations (solid), valid ERC-8004 registrations (dashed), and unique registration transactions (shaded) across chains.
Fig 2: Figure 2 . Cumulative agent registrations (solid), valid ERC-8004 registrations (dashed), and unique registration transactions (shaded) across chains.
Figure 3
Fig 3: Figure 3
Figure 5 . Distribution of agent URI activation status at the end of the observation period, by chain.
Fig 4: Figure 5 . Distribution of agent URI activation status at the end of the observation period, by chain.
Figure 7 . Distribution of agent URI scheme by chain.
Fig 5: Figure 7 . Distribution of agent URI scheme by chain.

Ready to evaluate your AI agents?

Learn how ReputAgent helps teams build trustworthy AI through systematic evaluation.

Learn More

Yes, But...

The study covers only three EVM chains (Ethereum, BSC, Base) and a specific time window ending May 13, 2026, so patterns may shift as adoption evolves. The Validation Registry (designed for independent attestations) had no active mainnet deployment during the window and therefore was not analyzed. Off-chain registration and feedback files can change after collection, so some observed URI or evidence gaps may be transient.

Methodology & More

The authors built a comprehensive dataset of every Identity and Reputation event for the ERC-8004 registries on Ethereum, BNB Smart Chain, and Base up to May 13, 2026, and fetched the referenced off-chain registration and feedback files, gas costs, and related payment proofs where available. They tracked identity lifecycles (mint vs activation), per-agent URIs and registration-file compliance, reviewer participation patterns, feedback tag usage, and whether feedback was backed by verifiable interaction evidence (for example, settled machine-to-machine payments). They then evaluated whether the deployed Reputation Registry satisfies four necessary conditions for a trustworthy score: commensurability, robustness, groundedness, and economic soundness. Key findings: registrations are dominated by batch-minted placeholders and templated deployments, with ownership concentrated and many URIs missing or noncompliant. The Reputation Registry stores signed numeric feedback but offers no enforced typing, no required evidence linking feedback to real interactions, and a plain mean aggregation that can be moved by a single or small set of wallets. Sybil-style manipulation is widespread and, because submitting feedback is nearly free, changing a reputation costs fractions of a cent on some chains. The paper turns these observations into concrete, implementable recommendations: require a canonical liveness predicate for identities; type rating tags with unit and range; supply a safe default aggregator (median or trimmed mean, clamped ranges, per-reviewer caps, evidence-weighting); require or at least flag evidence-backed feedback; impose a cost or stake that scales with influence; and provide default Sybil filters and cross-chain identity binding only after per-chain integrity is solved.
Avoid common pitfallsLearn what failures to watch for
Learn More
Credibility Assessment:

Includes William Knottenbelt, a known researcher in blockchain systems—moderate credibility.