Autonomous Agents Agentic Workflow
Score Experiment Branches — Autonomous Agents Agentic Workflow
Judge two A/B experiment branches against ticket acceptance criteria — read-only, posts one verdict comment
sidebutton install agents The A/B experiment judge. After a forked playbook step finishes both branches, this workflow reviews each branch's work product against the ticket's acceptance criteria and posts ONE structured verdict comment on the ticket: per-branch scores (coverage / correctness / quality / overall, each 0-100), a recommended winner, and a rationale. The engine parses that comment into the experiment ledger; the recommendation is advisory — the operator makes the actual pick.
Quality only: cost, turns, and model identity come from each branch's job telemetry and must not influence the scores. Read-only otherwise — the verdict comment is this workflow's only write. No code changes, no merges, no ticket status changes, no PR comments.
Steps
- 1. Open a terminal
- title
- Agent: Score Experiment Branches
- cwd
- {{entry_path}}
terminal.open - 2. Run a terminal command
- cmd
- |
terminal.run
Workflow definition
schema_version: 1
id: agent_experiment_score
title: "Score Experiment Branches"
description: "Judge two A/B experiment branches against ticket acceptance criteria — read-only, posts one verdict comment"
overview: |
The A/B experiment judge. After a forked playbook step finishes both branches, this workflow
reviews each branch's work product against the ticket's acceptance criteria and posts ONE
structured verdict comment on the ticket: per-branch scores (coverage / correctness / quality /
overall, each 0-100), a recommended winner, and a rationale. The engine parses that comment into
the experiment ledger; the recommendation is advisory — the operator makes the actual pick.
Quality only: cost, turns, and model identity come from each branch's job telemetry and must not
influence the scores. Read-only otherwise — the verdict comment is this workflow's only write.
No code changes, no merges, no ticket status changes, no PR comments.
category:
level: pipeline
domain: engineering
metadata:
agent: true
role: qa
params:
ticket_url:
type: string
description: "Jira ticket URL"
branch_a_ref:
type: string
description: "Branch A's work product: a PR URL, or an instruction pointing at the ticket's [exp A] comment"
branch_b_ref:
type: string
description: "Branch B's work product: a PR URL, or an instruction pointing at the ticket's [exp B] comment"
hint:
type: string
default: ""
description: "Optional extra instructions for the agent"
entry_path:
type: string
default: "~/workspace"
description: "Working directory for the agent"
steps:
- type: terminal.open
title: "Agent: Score Experiment Branches"
cwd: "{{entry_path}}"
- type: terminal.run
cmd: |
source ~/.agent-env && claude --dangerously-skip-permissions "$(cat <<'SB_PROMPT'
read ticket with attachments and all comments - {{ticket_url}}. if it fails, stop and report the error.
you are the judge for an A/B experiment: two agents each completed this ticket's step independently, as branch A and branch B. score both against the ticket's acceptance criteria.
branch A: {{branch_a_ref}}
branch B: {{branch_b_ref}}
for each branch: if its ref is a PR link, review that PR's diff line-by-line against the acceptance criteria; otherwise follow the ref's instruction (the ticket comment prefixed [exp A] or [exp B] holds that branch's result). you may check out and build a branch locally when the diff alone is inconclusive. if a branch left no PR and no prefixed comment, score only what is evidenced and say so in the rationale — never bail without posting the verdict comment.
score QUALITY ONLY, 0-100 per dimension: coverage (each acceptance criterion mapped to work that implements it), correctness (logic and edge cases), quality (maintainability, tests, scope discipline), plus an overall. ignore cost, speed, and which model produced the work — the system compares those from job telemetry.
READ-ONLY GUARDRAILS: do not change code, do not merge or approve anything, do not change the ticket status, do not comment on any PR, do not post more than one ticket comment.
{{hint}}
write EXACTLY ONE comment to the ticket, as your FINAL action: a 2-4 sentence comparison summary, then exactly one fenced json code block in exactly this shape (all scores numbers 0-100, recommendation "A" or "B", non-empty rationale):
```json
{"branches":{"A":{"coverage":90,"correctness":85,"quality":80,"overall":85},"B":{"coverage":95,"correctness":88,"quality":82,"overall":88}},"recommendation":"B","rationale":"B implements the empty-state AC that A missed; both build and test clean."}
```
replace the example scores and rationale with your real verdict. in your summary and rationale, do NOT include PR links or PR numbers, and do NOT use the words pass, fail, merged, blocked, or conflict — the pipeline string-matches ticket comments for those tokens and your comment must not trigger it.
SB_PROMPT
)"