OpenAI just stepped into the EVM and brought a benchmark with teeth. From San Francisco, the AI lab founded in 2015 by Sam Altman, Elon Musk, Greg Brockman, Ilya Sutskever, Wojciech Zaremba and others introduced EVMbench, an open framework designed to test whether AI agents can actually perform in smart contract security or simply narrate competence. In a cycle crowded with model demos and leaderboard theater, this is the kind of tech news that separates claims from capability.
EVMbench was developed in collaboration with Paradigm, the San Francisco based crypto investment firm founded in 2018 by Matt Huang and Fred Ehrsam, and OtterSec, the web3 security auditing firm founded by Robert Chen. The premise is simple and unforgiving. AI agents must Detect real vulnerabilities, Patch them without breaking core functionality, and Exploit them end to end inside a sandboxed EVM. Detect, Patch, Exploit. 3 verbs that turn theory into consequence.
Underneath the surface sit 120 curated vulnerabilities drawn from 40 real audit reports, including findings connected to Tempo, the payments focused Layer 1 co developed by Paradigm and Stripe. These are not synthetic puzzles drafted for academic comfort. They are extracted from live codebases where security failures carry financial weight. Agents scan repositories, produce vulnerability reports, submit code fixes, and, when prompted, attempt to drain funds in a controlled environment via JSON RPC. If the exploit executes, the benchmark records it. If the patch breaks functionality, it records that too.
The technical paper credits Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, and Olivia Watkins. OpenAI, Paradigm, OtterSec. Research depth from AI, crypto, and frontline auditing in one frame. Most coverage names the organizations, not the individuals, but the author list tells you who engineered the test itself. That matters in tech news, where credibility is often buried beneath volume.
Smart contracts secure tens of billions in value. Every overlooked vulnerability is a liability with a clock attached. As AI models sharpen in code understanding and generation, the question shifts from can they write contracts to can they secure them, and if not, can they exploit them faster than human defenders can react.
OpenAI has already expanded its security posture with initiatives like Trusted Access for Cyber and API credits aimed at defenders. EVMbench aligns with that trajectory. It is not a product pitch. It is a measuring instrument published in public view. In a market saturated with forward looking statements, measurement is leverage.
Crypto has long pursued trust minimized systems. AI pursues capability maximized systems. EVMbench stands between them and forces both to prove performance under pressure. This is the strain of tech news that does not fade after a 24 hour cycle. It lingers in roadmaps, audit workflows, and model evaluations. The only real question now is who runs the benchmark next, and what it reveals when their agent meets the code.

