MetaHarnessIn Development

MetaHarness

In Development

Self-learning, self-evolving agent runtime with pluggable model-serving. The infrastructure layer under dossbot.

DRI: Wiley Jones
Overview

MetaHarness is the runtime dossbot runs on. Pluggable model serving (OAI-compatible APIs for any model), a tool-calling loop that treats every tool as a durable ZFlow step, feedback capture on every action, and a self-tuning layer that adjusts routing, prompts, and tool definitions from observed performance. Models improve over time; so should the harness around them.

Core ideas

What makes it a harness, not just a wrapper

Tool = ZFlow step

Every tool the agent can call is registered as a durable ZFlow function. Retries are safe, outputs are journaled, long-running tools don't block the conversation.

OAI-compatible everywhere

All model backends — Anthropic, OpenAI, Gemini, self-hosted on Baseten/Together — speak one API surface. Routing policy decides which provider sees which request.

Feedback as first-class data

Every step records a reward signal: did the tool succeed, did the user accept the output, did a follow-up run correct it? These signals feed the self-tuning layer.

Budget-aware routing

Per-tenant policies constrain cost and latency. A cheap model for low-stakes queries, a frontier model for migrations, a local model for anything touching PHI.

Routing surface

Provider matrix

All providers expose an OAI-compatible API. MetaHarness picks the target per request based on task type, tenant budget, and latency target.

ProviderHostingStrengthsTypical use
Anthropic ClaudeAPILong context, tool use, writingDefault frontier agent
OpenAIAPIStructured output, function callingStructured extraction
Google GeminiAPI + VertexMultimodal, pricingDoc / screenshot parsing
BasetenSelf-host on GCPCustom fine-tunesCENTARI serving
Vertex AIGCPEnterprise sovereigntyRegulated-data tenants
Together / FireworksSelf-hostOSS models at costBulk, low-stakes calls
The self-evolving loop

How the harness tunes itself

  1. 01

    Collect

    Every run emits a trace: model, tools, tokens, latency, cost, user feedback, downstream corrections.

  2. 02

    Aggregate

    Traces land in ClickHouse. Rollups by tool, model, task type, tenant, and time window give a signal about what's working.

  3. 03

    Propose

    A tuning agent suggests changes: demote a bad-performing tool description, reroute a task class to a different model, tighten a prompt that's producing inconsistent output.

  4. 04

    Evaluate

    Proposed changes run against an ENT-Bench-style replay harness on historical traces. Only wins with statistical headroom get promoted.

  5. 05

    Promote

    Canary a new config to a small tenant slice. Roll forward if metrics hold; roll back if they don't.

Stack

Vercel AI SDKTool-calling loop, streaming, typed messages.
ZFlowEvery tool call is a durable step.
ClickHouseTrace store for analytics and replay.
BraintrustEval and regression tracking against ENT-Bench.
TurbopufferVector search for tool and context retrieval.
TapestryGrounded context at tool-call time.

In the stack