ENT-Bench

Active

Application-layer benchmarks for evaluating AI agents on real enterprise tasks.

DRI: Matteo Carrabba

Overview

ENT-Bench is an open benchmark suite for enterprise AI. It grades agents on whether they can build, modify, debug, and migrate real application artifacts — not toy problems and not code alone. Each task runs under four robustness variants and scores through a three-layer pipeline ending in an LLM-as-judge with a rubric.

Task families

The four evaluations

Eval	What it tests	Representative task
Greenfield Build	Design a new instance from a spec	Stand up a vendor onboarding application from a 2-page PRD
Modify Existing	Brownfield changes without regressions	Add a new approval step without breaking the existing 18 workflows
Root Cause Analysis	Diagnose a broken production state	Given three days of logs and a failing nightly job, locate the bad projection
Data Migration	Map messy source data into a target schema	Migrate a 1.3M-row legacy vendor table with inconsistent casing and fuzzy duplicates

Difficulty ladder

L1 → L4

01
L1 — Single artifact
Bounded, unambiguous. 'Add a required `taxId` column to the vendors table.' Pass if the artifact matches the spec.
02
L2 — Artifact set
Coordinated change across a few artifacts. 'Add tax tracking: column, form field, workflow step, report filter.'
03
L3 — Subsystem
Requires reasoning across a feature area. 'Redesign the vendor approval flow so legal review happens before tax review — preserve historical runs.'
04
L4 — Full system
Ambiguous requirements, real trade-offs. 'The CFO wants cost center accounting. Figure out what that means here and propose a plan before you change anything.'

Scoring pipeline

Three layers

01
L1 validators — pass/fail gate
Does it compile? Does it plan cleanly? Does it apply without errors? Fail early; don't waste judge tokens on broken output.
02
L2 structural — deterministic 0–1
Did the produced artifacts match the expected structure? Column types right, workflow graph isomorphic, permission grants correct. Mechanical, reproducible.
03
L3 LLM-as-judge — rubric 0–1
For anything that can't be checked structurally — naming, idiomatic ZSL, code smell, graceful migration design. Always accompanied by a written rationale.

Robustness variants

Clean

The spec as written. The baseline any agent should pass before we evaluate anything else.

Noisy

Real-world messiness: typos in the prompt, inconsistent column naming in sample data, contradictory sentences in the same PRD.

Perturbed

Small semantic shifts — synonyms, reworded requirements, metric unit changes. Does the agent over-fit to surface phrasing?

Adversarial

Prompts that nudge the agent toward a wrong-but-plausible solution. Does it notice the trap, or just comply?

Leaderboard

Baselines (tracked over time)

Scores are running aggregates across the four evaluations and four robustness variants. Exact numbers live in Braintrust; this is a snapshot.

Agent	Greenfield	Modify	RCA	Migration
Claude Code	tracked	tracked	tracked	tracked
Codex	tracked	tracked	tracked	tracked
Gemini	tracked	tracked	tracked	tracked
Intellect-3	tracked	tracked	tracked	tracked
Flatfile	—	—	—	tracked

In the stack

Paired withCENTARI

ENT-Bench is the scoreboard CENTARI is trained to climb. RL rewards flow from here.

Used byMetaHarness

Every harness under evaluation runs via MetaHarness for consistent tool and model plumbing.

UsesZSL

Tasks are specified in ZSL — ground truth is a diff in a ZSL repo.

DOSS

Careers Contact

ENT-Bench

The four evaluations

L1 → L4

L1 — Single artifact

L2 — Artifact set

L3 — Subsystem

L4 — Full system