Skip to content

Promptfoo Parity Matrix

AgentV uses a similar eval config contract to Promptfoo for ordinary authored evals: prompt matrices, test rows, vars, default test data, assertions, and target matrices all use the same broad shape. AgentV keeps the wire format snake_case, keeps target identity separate from provider/backend selection, and adds repo-native workspace and artifact fields for agent evaluation.

Use this matrix when translating a Promptfoo-style normal eval into AgentV YAML. It documents which surfaces align directly, which AgentV surfaces are cleaner greenfield extensions, and which Promptfoo surfaces are deferred until AgentV implements equivalent semantics directly.

DecisionMeaning
Align with PromptfooAgentV accepts the same concept, with snake_case where the field crosses the YAML boundary.
Keep AgentV divergenceAgentV intentionally uses a different shape because it is clearer for repo-native agent evals.
Keep AgentV extensionAgentV adds a capability that does not try to be Promptfoo-compatible.
Defer/future-scopeAgentV does not accept the Promptfoo surface yet. Use an AgentV primitive or wait for direct implementation.
SurfacePromptfoo shapeAgentV shapeDecisionNotes
Prompt matrixTop-level prompts rendered with each test’s vars.Top-level prompts rendered with tests[].vars and default_test.vars.Align with PromptfooThis is the canonical Promptfoo-compatible input shape in AgentV. Prompt entries can be inline strings, chat arrays, files, or generated prompt functions.
Test rowstests can be inline rows or a case-file reference; rows carry vars, assert, metadata, prompt/provider filters, and expected data.tests can be inline rows or a raw-case path; rows carry vars, assert, expected_output, metadata, workspace overrides, and run overrides.Align with PromptfooAgentV also supports imports.suites and imports.tests for explicit composition. Raw cases do not own suite context.
Variablestests[].vars plus defaultTest.vars; prompt templates can reference top-level var names.tests[].vars plus default_test.vars; templates can use {{ name }} or {{ vars.name }}.Align with PromptfooPer-test vars override default vars by key.
Default testdefaultTest, inline object or file:// reference.default_test, inline object or file:// / ref:// reference.Align with PromptfooAgentV uses snake_case for YAML. Shared prompt matrix defaults belong in default_test.vars.
Evaluate optionsevaluateOptions for runtime controls.evaluate_options for runtime controls.Align with PromptfooAgentV uses evaluate_options.repeat, evaluate_options.budget_usd, and evaluate_options.max_concurrency.
Authored concurrencyCommon Promptfoo usage includes runtime options such as maxConcurrency.evaluate_options.max_concurrency.Keep AgentV divergenceDo not author execution.max_concurrency or top-level workers in eval YAML. CLI --workers remains an operator override.
Target selectionPromptfoo normal evals use providers; targets can alias providers in unified config.Use top-level target for one system under test or top-level targets for a target matrix.Keep AgentV divergenceAgentV reserves provider for the backend/adapter kind inside a target object. Top-level providers is rejected to avoid overloading that term.
Target object identityProvider options often use id for backend/provider spec and optional label for display or matching.Target objects use stable id for target identity, provider for backend kind, optional runtime, and config for provider settings.Keep AgentV divergenceAgentV does not copy Promptfoo’s label/id baggage because provider already names the backend boundary.
Direct authored inputPromptfoo prompt authoring normally goes through prompts plus vars.Top-level input and inline tests[].input are removed from normal authored eval YAML. External raw-case imports may still carry internal input rows for compatibility.Removed AgentV extensionAuthor prompt text, chat/system/user messages, and file-backed prompt content as prompts; put row data in tests[].vars and shared defaults in default_test.vars.
Suite assertionsassert entries can be strings or typed assertion objects.assert entries can be strings, typed assertion objects, script graders, or AgentV extension graders.Align with PromptfooPlain strings become semantic rubric checks. Use assert, not assertions, in current authored eval YAML.
Assertion groupingtype: assert-set with child assert entries, optional config, metric, weight, and threshold.type: assert-set with child assert, optional config, metric names, weights, and parent threshold.Align with PromptfooParent config is inherited by child assertions; child config keys override shared parent keys. Without threshold, pass/fail follows nonzero-weight child assertions. With threshold, the weighted aggregate score determines pass/fail. type: composite is rejected; use assert-set.
Deterministic assertion vocabularyCommon Promptfoo types include contains, icontains, contains-any, contains-all, starts-with, regex, is-json, equals, latency, cost, javascript, python, webhook, similar, and llm-rubric.AgentV accepts the implemented overlap, including contains, icontains, contains-any, contains-all, starts-with, regex, is-json, equals, latency, cost, javascript, python, webhook, similar, and llm-rubric.Align with PromptfooUnsupported Promptfoo assertion names error instead of silently becoming custom assertion names.
Custom assertion terminologyPromptfoo calls normal eval custom logic assertions, with fixed code assertion types such as javascript, python, ruby, and webhook.defineAssertion() files in .agentv/assertions/ become reusable assertion type names.Keep AgentV extensionAgentV keeps assertion terminology and extends discovery to arbitrary assertion type names such as has-citation.
Script/custom grader terminologyPromptfoo custom code assertions are still assertion types.defineScriptGrader() powers command-backed graders referenced with type: script and command:.Keep AgentV divergenceUse script grader wording only for command-backed or LLM-backed scoring components that need explicit score and assertion-result control.
Tool and trace assertionsPromptfoo includes trajectory:tool-used, trajectory:tool-sequence, trajectory:tool-args-match, trajectory:step-count, trajectory:goal-success, tool-call-f1, skill-used, trace-span-count, trace-span-duration, and trace-error-spans.AgentV rejects those names until their semantics are implemented directly.Defer/future-scopeThese names are not aliases for AgentV’s tool-trajectory grader.
Tool trajectory graderNo direct Promptfoo alias for AgentV-normalized transcript semantics.type: tool-trajectory.Keep AgentV extensionThis is AgentV-specific and operates over AgentV-normalized transcripts and trace summaries.
Repo-native workspace fieldsPromptfoo normal evals do not own AgentV workspace materialization.workspace, workspace.repos, workspace.scope, workspace.docker, extensions, and per-test workspace.Keep AgentV extensionAgentV evaluates real repositories and agent workspaces, so workspace provenance is first-class authored config.
Run artifacts and inspectionPromptfoo owns its own result viewer and output formats.AgentV writes .agentv/results/<run_id>/ bundles with summary.json, .internal/index.jsonl, sidecars, and local Dashboard support.Keep AgentV extensionAgentV-owned bundles are the source of truth for compare, Dashboard, CI, and adapters. Phoenix is link-out correlation only through safe external trace metadata.
Compare commandPromptfoo has its own result comparison surfaces.agentv results compare <baseline-index.jsonl> <candidate-index.jsonl>.Keep AgentV extensionCompare consumes completed AgentV run indexes such as .agentv/results/<run_id>/.internal/index.jsonl.
CLI runtime filtersPromptfoo exposes filters such as prompt/provider/test subset flags.AgentV supports its current CLI filters and selection fields; full Promptfoo runtime-filter parity is future work.Defer/future-scopePrefer authored select/imports or current AgentV CLI flags until runtime-filter parity lands.
Wire-format casingPromptfoo config uses camelCase fields such as defaultTest and evaluateOptions.AgentV YAML, JSONL, artifacts, and CLI JSON use snake_case; internal TypeScript uses camelCase.Keep AgentV divergenceTranslate only at process boundaries. New public wire fields should be snake_case.
Hard-rejected stale AgentV fieldsNot applicable to Promptfoo.Removed AgentV-era fields such as top-level execution, execution.target, execution.targets, top-level budget_usd, top-level repeat/runs, and composite are rejected.Keep AgentV divergenceUse top-level target/targets, evaluate_options, evaluate_options.repeat, and assert-set. Migration guidance lives in the eval migration skill reference.
description: Release-note summarization
target: local-mini
prompts:
- id: direct
label: Direct
prompt: "Summarize {{ topic }} for {{ audience }}."
default_test:
vars:
audience: engineers
evaluate_options:
max_concurrency: 2
tests:
- id: release-notes
vars:
topic: the July release notes
expected_output: concise release-note summary
assert:
- Identifies the most important change
- type: assert-set
metric: release_gate
threshold: 0.8
assert:
- type: contains
value: July
- type: llm-rubric
value: The answer is concise and accurate.
description: Repo-native direct task suite
target:
id: codex-local
provider: codex-app-server
runtime: host
config:
command: ["codex", "app-server"]
workspace:
repos:
- path: ./app
repo: acme/support-app
commit: main
scope: attempt
prompts:
- - role: user
content:
- type: file
value: ./instructions.md
- type: text
value: "{{ task }}"
tests:
- id: refund-policy
vars:
task: Update the refund policy handler.
expected_output: The handler supports the damaged-item exception.
assert:
- type: tool-trajectory
mode: any_order
minimums:
shell: 1
- type: script
command: [bun, run, graders/check-refund-policy.ts]