ENT-Bench
Application-layer benchmarks for evaluating AI agents on real enterprise tasks.
ENT-Bench is an open benchmark suite for enterprise AI. It grades agents on whether they can build, modify, debug, and migrate real application artifacts — not toy problems and not code alone. Each task runs under four robustness variants and scores through a three-layer pipeline ending in an LLM-as-judge with a rubric.
The four evaluations
| Eval | What it tests | Representative task |
|---|---|---|
| Greenfield Build | Design a new instance from a spec | Stand up a vendor onboarding application from a 2-page PRD |
| Modify Existing | Brownfield changes without regressions | Add a new approval step without breaking the existing 18 workflows |
| Root Cause Analysis | Diagnose a broken production state | Given three days of logs and a failing nightly job, locate the bad projection |
| Data Migration | Map messy source data into a target schema | Migrate a 1.3M-row legacy vendor table with inconsistent casing and fuzzy duplicates |
L1 → L4
- 01
L1 — Single artifact
Bounded, unambiguous. 'Add a required `taxId` column to the vendors table.' Pass if the artifact matches the spec.
- 02
L2 — Artifact set
Coordinated change across a few artifacts. 'Add tax tracking: column, form field, workflow step, report filter.'
- 03
L3 — Subsystem
Requires reasoning across a feature area. 'Redesign the vendor approval flow so legal review happens before tax review — preserve historical runs.'
- 04
L4 — Full system
Ambiguous requirements, real trade-offs. 'The CFO wants cost center accounting. Figure out what that means here and propose a plan before you change anything.'
Three layers
- 01
L1 validators — pass/fail gate
Does it compile? Does it plan cleanly? Does it apply without errors? Fail early; don't waste judge tokens on broken output.
- 02
L2 structural — deterministic 0–1
Did the produced artifacts match the expected structure? Column types right, workflow graph isomorphic, permission grants correct. Mechanical, reproducible.
- 03
L3 LLM-as-judge — rubric 0–1
For anything that can't be checked structurally — naming, idiomatic ZSL, code smell, graceful migration design. Always accompanied by a written rationale.
Robustness variants
Clean
The spec as written. The baseline any agent should pass before we evaluate anything else.
Noisy
Real-world messiness: typos in the prompt, inconsistent column naming in sample data, contradictory sentences in the same PRD.
Perturbed
Small semantic shifts — synonyms, reworded requirements, metric unit changes. Does the agent over-fit to surface phrasing?
Adversarial
Prompts that nudge the agent toward a wrong-but-plausible solution. Does it notice the trap, or just comply?
Baselines (tracked over time)
Scores are running aggregates across the four evaluations and four robustness variants. Exact numbers live in Braintrust; this is a snapshot.
| Agent | Greenfield | Modify | RCA | Migration |
|---|---|---|---|---|
| Claude Code | tracked | tracked | tracked | tracked |
| Codex | tracked | tracked | tracked | tracked |
| Gemini | tracked | tracked | tracked | tracked |
| Intellect-3 | tracked | tracked | tracked | tracked |
| Flatfile | — | — | — | tracked |