Benchmark Testing and Analysis Lead

Remote

Full-time

Visa Sponsorship

About the role

A technical researcher to own how we evaluate frontier models on the ARC-AGI benchmarks. This person will run new models end-to-end, mine the data exhaust from every run, and translate what we learn into reports and public communication that shape the conversation on where model capability is heading. This is a remote, full-time role.

What You'll Do:

Own our model benchmarking and testing process, and run new frontier models against ARC-AGI-1, ARC-AGI-2, and ARC-AGI-3 as they ship
Build and own the ARC Prize Analysis Package - a repeatable report produced for every new frontier model, turning raw logs into insight on capability, failure modes, and gaps
Own the official and community leaderboards end-to-end - from scoring pipeline to public page
Serve as primary contact for new labs testing on ARC-AGI, and communicate findings externally via Twitter, newsletter, and policy and partner briefings

What We're Looking For:

Research background with hands-on model evaluation experience - you've run evals before and know how to read the results (model training experience not required)
Deep understanding of how modern models work and fail, and comfortable building your own tooling and analysis to answer the questions you care about
Strong ownership instinct and clear technical communicator

Example outputs this role would produce: a model score announcement and a .

ARC Prize builds AI benchmarks that measure general intelligence. Our benchmark, ARC-AGI, has been used by OpenAI, Anthropic, Google DeepMind, and xAI.

Founded by Mike Knoop and Francois Chollet, we inspire open source artificial general intelligence (AGI) research through benchmarks (the ARC-AGI series), global competitions, research grants, community, and content, we exist to guide researchers, industry, and regulators on the path to AGI.

We believe that AGI requires more than just scaling up existing AI models. It demands a fundamental shift towards systems capable of genuine fluid intelligence, the ability to adapt to novel challenges and solve problems efficiently, much like humans do.

Benchmark Testing and Analysis Lead

About the role

About ARC Prize Foundation

Required skills

Other roles at ARC Prize Foundation

Job details

Company

Founders