MetaHarness

In Development

Self-learning, self-evolving agent runtime with pluggable model-serving. The infrastructure layer under dossbot.

DRI: Wiley Jones

Overview

MetaHarness is the runtime dossbot runs on. Pluggable model serving (OAI-compatible APIs for any model), a tool-calling loop that treats every tool as a durable ZFlow step, feedback capture on every action, and a self-tuning layer that adjusts routing, prompts, and tool definitions from observed performance. Models improve over time; so should the harness around them.

Core ideas

What makes it a harness, not just a wrapper

Tool = ZFlow step

Every tool the agent can call is registered as a durable ZFlow function. Retries are safe, outputs are journaled, long-running tools don't block the conversation.

OAI-compatible everywhere

All model backends — Anthropic, OpenAI, Gemini, self-hosted on Baseten/Together — speak one API surface. Routing policy decides which provider sees which request.

Feedback as first-class data

Every step records a reward signal: did the tool succeed, did the user accept the output, did a follow-up run correct it? These signals feed the self-tuning layer.

Budget-aware routing

Per-tenant policies constrain cost and latency. A cheap model for low-stakes queries, a frontier model for migrations, a local model for anything touching PHI.

Routing surface

Provider matrix

All providers expose an OAI-compatible API. MetaHarness picks the target per request based on task type, tenant budget, and latency target.

Provider	Hosting	Strengths	Typical use
Anthropic Claude	API	Long context, tool use, writing	Default frontier agent
OpenAI	API	Structured output, function calling	Structured extraction
Google Gemini	API + Vertex	Multimodal, pricing	Doc / screenshot parsing
Baseten	Self-host on GCP	Custom fine-tunes	CENTARI serving
Vertex AI	GCP	Enterprise sovereignty	Regulated-data tenants
Together / Fireworks	Self-host	OSS models at cost	Bulk, low-stakes calls

The self-evolving loop

How the harness tunes itself

01
Collect
Every run emits a trace: model, tools, tokens, latency, cost, user feedback, downstream corrections.
02
Aggregate
Traces land in ClickHouse. Rollups by tool, model, task type, tenant, and time window give a signal about what's working.
03
Propose
A tuning agent suggests changes: demote a bad-performing tool description, reroute a task class to a different model, tighten a prompt that's producing inconsistent output.
04
Evaluate
Proposed changes run against an ENT-Bench-style replay harness on historical traces. Only wins with statistical headroom get promoted.
05
Promote
Canary a new config to a small tenant slice. Roll forward if metrics hold; roll back if they don't.