ENT-BenchActive

ENT-Bench

Active

Application-layer benchmarks for evaluating AI agents on real enterprise tasks.

DRI: Matteo Carrabba
Overview

ENT-Bench is an open benchmark suite for enterprise AI. It grades agents on whether they can build, modify, debug, and migrate real application artifacts — not toy problems and not code alone. Each task runs under four robustness variants and scores through a three-layer pipeline ending in an LLM-as-judge with a rubric.

Task families

The four evaluations

EvalWhat it testsRepresentative task
Greenfield BuildDesign a new instance from a specStand up a vendor onboarding application from a 2-page PRD
Modify ExistingBrownfield changes without regressionsAdd a new approval step without breaking the existing 18 workflows
Root Cause AnalysisDiagnose a broken production stateGiven three days of logs and a failing nightly job, locate the bad projection
Data MigrationMap messy source data into a target schemaMigrate a 1.3M-row legacy vendor table with inconsistent casing and fuzzy duplicates
Difficulty ladder

L1 → L4

  1. 01

    L1 — Single artifact

    Bounded, unambiguous. 'Add a required `taxId` column to the vendors table.' Pass if the artifact matches the spec.

  2. 02

    L2 — Artifact set

    Coordinated change across a few artifacts. 'Add tax tracking: column, form field, workflow step, report filter.'

  3. 03

    L3 — Subsystem

    Requires reasoning across a feature area. 'Redesign the vendor approval flow so legal review happens before tax review — preserve historical runs.'

  4. 04

    L4 — Full system

    Ambiguous requirements, real trade-offs. 'The CFO wants cost center accounting. Figure out what that means here and propose a plan before you change anything.'

Scoring pipeline

Three layers

  1. 01

    L1 validators — pass/fail gate

    Does it compile? Does it plan cleanly? Does it apply without errors? Fail early; don't waste judge tokens on broken output.

  2. 02

    L2 structural — deterministic 0–1

    Did the produced artifacts match the expected structure? Column types right, workflow graph isomorphic, permission grants correct. Mechanical, reproducible.

  3. 03

    L3 LLM-as-judge — rubric 0–1

    For anything that can't be checked structurally — naming, idiomatic ZSL, code smell, graceful migration design. Always accompanied by a written rationale.

Robustness variants

Clean

The spec as written. The baseline any agent should pass before we evaluate anything else.

Noisy

Real-world messiness: typos in the prompt, inconsistent column naming in sample data, contradictory sentences in the same PRD.

Perturbed

Small semantic shifts — synonyms, reworded requirements, metric unit changes. Does the agent over-fit to surface phrasing?

Adversarial

Prompts that nudge the agent toward a wrong-but-plausible solution. Does it notice the trap, or just comply?

Leaderboard

Baselines (tracked over time)

Scores are running aggregates across the four evaluations and four robustness variants. Exact numbers live in Braintrust; this is a snapshot.

AgentGreenfieldModifyRCAMigration
Claude Codetrackedtrackedtrackedtracked
Codextrackedtrackedtrackedtracked
Geminitrackedtrackedtrackedtracked
Intellect-3trackedtrackedtrackedtracked
Flatfiletracked

In the stack