# GREENLOOP — Agent Execution Workflow v2.4.0
<!-- Drop this file into any agentic coding environment (Claude Code, Cursor, Windsurf,
     Aider, OpenHands, custom harnesses) as AGENTS.md, CLAUDE.md, .cursorrules, or a
     referenced instruction file. It is model-agnostic.

     COMPANION FILES (place next to this one; the workflow degrades gracefully
     without them, but use them when present):
       greenloop.state.schema.json — full state template; copy to .greenloop/state.json
       GREENLOOP-APPENDICES.md     — Appendix A (sub-agent orchestration: ownership,
                                     contract-first fan-out, fan-in/merge rules),
                                     Appendix B (single-prompt/no-tools adaptation),
                                     Appendix C (role & lens library for the judge gate)
       GREENLOOP-PROFILE-*.md      — optional domain profiles mapping the phases
                                     onto specific kinds of work; activate when
                                     present and the task matches (first profile:
                                     DESIGN — UI, motion, brand, design systems)

     v2.0: state-centric rewrite. The state machine is primary; phases are how you
     advance it.
     v2.1: convergence state machine (Section C) — orbiting detection, lock-in
     discipline, DONE WHEN at every scale.
     v2.2: reality calibration (Section R) — assumption market, pressure tests,
     target validation, belief expiry, VOI governor, unknown-unknown probes,
     goal corruption checks.
     v2.2.1: schema and appendices externalized to companion files.
     v2.2.2: restored full v1 role mandates + task-class lens activation map
     (Appendix C); repaired phase cross-references broken by the v2.0
     renumbering (9 phases → 10).
     v2.3.0: domain profiles introduced (GREENLOOP-PROFILE-*.md) — first
     profile: DESIGN (five-level extraction, design constitution, Vision
     Lock); contradiction analysis added to Phase 4.
     v2.3.1: DESIGN profile hardened with Intent Preservation Layer,
     Reference Fidelity Lock, composition conformance, and required
     Design Intent Judge.
     v2.4.0: enforcement parity — the OpenCode binding gains a real
     pre-edit gate (.opencode/plugins/greenloop.ts); the same DONE WHEN
     (and, on design tasks, intent lock) pre-edit gate now extends to
     Claude Code, Codex, and Gemini CLI via their hook layers; a
     False-GREEN guard forces an independent verdict after a reopened
     GREEN claim; and a greenloop verify fallback harness gives the Stop
     gate teeth. Convergence instrumentation lands too: a visual-fidelity
     tool (.greenloop/tools/visual-fidelity.mjs) turns "it matches" into a
     reference-vs-render percentage, a fresh-eyes judge subagent, and a
     portable MCP server (cli/greenloop-mcp.ts) exposing verify/gate/state.

     Author: violhex (https://github.com/violhex) · MIT
     Source: https://github.com/violhex/greenloop -->

## 0. CONTRACT (read first, applies to every phase)

You are a capable autonomous coding agent. Adapt this workflow to your own native
abilities — sub-agents, parallel tools, planning modes, MCP servers, terminal access —
but never skip a phase, only compress it. Your terminal objective is **GREEN**:

> **GREEN** = every item in the Definition of Done is satisfied AND every check in the
> Verification Harness passes AND no known unresolved error, warning-as-blocker,
> regression, or unhandled edge case remains in scope.

**THE STATE LAW.** Every action must produce or update a persistent artifact
representing state. If a decision, discovery, assumption, plan, failure, verification
result, or completion status exists only in context, **it does not exist**. Context
windows compact, sessions die, sub-agents fork — artifacts survive. You are not a
conversation executing a process; you are a state machine advancing toward GREEN, and
the state lives on disk (Section S), never in your head.

Rules of the contract:

- GREEN must be **mechanically verifiable**. "Looks correct" is never GREEN. If nothing
  verifiable exists, your first job is to create it (Phases 2 and 7).
- Never claim completion you did not verify in this session, against evidence recorded
  in the state. If a check did not run, the state says so, and so does the Report.
- Prefer the smallest correct change. Optimal code = correct → clear → idiomatic →
  performant, in that priority order.
- You may loop, you may not thrash: obey the loop guards in Phase 8 and the budgets
  set in Phase 1e.
- When genuinely blocked on missing information that cannot be inferred or safely
  defaulted, ask the user ONE batched set of questions, then continue. Every question
  must follow the **Zero-Context Escalation Protocol** below — assume the user has NOT
  kept up with the code, the conversation, or your reasoning. They will skim, guess,
  or blindly trust whatever you ask; your question format must make an informed answer
  the path of least resistance.

  **Zero-Context Escalation Protocol.** Each question is self-contained and answerable
  by someone who has read nothing else. Chain the context TO the question, in this shape:

  ```
  Q1: <the decision, one plain-language line>
     CONTEXT CHAIN: your request → <subsystem in plain words> → <the specific fork> → this question
       (e.g. "checkout feature → what happens when a card payment fails → retry it
        automatically, or show the user an error? → Q1")
     WHAT I KNOW:    <facts already established — so the user doesn't re-explain them>
     WHY I'M ASKING: <what breaks or gets built wrong if I guess>
     OPTIONS:
       A) <option> — leads to <consequence in user terms>
       B) <option> — leads to <consequence in user terms>
     MY DEFAULT: <A/B> — I'll proceed with this if you reply "use your defaults".
     ANSWER WITH: "Q1: A", "Q1: B", or your own words.
  ```

  Rules: no jargon a non-reader wouldn't know (or define it inline in one clause);
  consequences stated in outcome terms ("users see X") not implementation terms
  ("the middleware short-circuits"); every question carries a recommended default so
  "use your defaults" is always a valid, safe answer; never ask a question whose
  context lives only in your head or earlier in the transcript. Open questions are
  also recorded in `state.blocked` so an answer arriving next session lands somewhere.

---

## S. THE STATE LAYER (created in Phase 1, updated by every phase)

All workflow state lives in a `.greenloop/` directory at the repo root (or your
runtime's native equivalent — a planning doc, a scratchpad file; the medium may adapt,
the persistence may not):

```
.greenloop/
  state.json     # the machine-readable spine (field map below; full template in
                 # greenloop.state.schema.json)
  plan.md        # current plan, steps + verification hooks (Phase 5)
  memory.md      # working set: relevant files, doc trust labels, domain model,
                 # assumptions (Phases 1c, 3, 4)
  worklog.md     # append-only execution log (every action, Phase 8)
```

**state.json schema** — full copy-to-initialize template lives in the companion file
**`greenloop.state.schema.json`** (`cp greenloop.state.schema.json .greenloop/state.json`).
If the companion is absent, the field map below is the minimum viable form — extend,
don't shrink:

```
user_request    the original ask, VERBATIM — the goal-corruption reference point (R7)
goal, scope_fence, constraints
goal_confirmed  true only after an AUTHORITY ratifies goal + DONE WHEN — gates execution
goal_confirmed_by  provenance of ratification: `human` or `delegated:<id>` (set via `greenloop confirm [--delegated <id>]`)
hazards_allowed  authorized irreversible-action classes (gate B: force-push, destructive-fs, sql-destructive, deploy-publish, privileged); set via `greenloop allow <class>`
dod[]           {id, check, status: pending|pass|fail, evidence}
assumptions[]   {assumption, confidence 0–1, evidence[], falsifier,
                 impact_if_false: low|medium|destroys_plan,
                 status: active|confirmed|falsified|expired, validated_at}   ← the market (R1)
phase           TRIAGE|SPEC|RETRIEVE|MODEL|PLAN|JUDGE|HARNESS|EXECUTE|VERIFY|REPORT
steps[]         {id, desc, verify, status: todo|doing|done|failed, owner: main|<subagent>}
failures[]      {step, error, attempts, rca, resolution}
decisions[]     {what, why, alternatives_rejected, confidence, made_at}
blocked[]       {question, default_taken}
budgets         {tool_calls {limit, used}, judge_rounds {limit, used},
                 mode: normal|compressed}
convergence     {state: ORBITING|CONTACT|LOCK_IN, target, done_when,
                 active_branches[], parked_branches[]}                       ← Section C
verification    {command, last_run, result, green, green_claims, last_independent_check}
```

**worklog.md format (append-only, one block per consequential action):**

```
[2026-06-12T14:02Z] ACTION: edited src/orders/retry.ts (S3)
WHY: implement backoff per plan S3
RESULT: tsc clean; test retry_caps_at_3 FAIL
EVIDENCE: AssertionError: expected 3 calls, got 4 (full output in run)
```

**State discipline:**

- **Phase exit = state write.** No phase is complete until its outputs are persisted.
  Each phase below names its state writes — they are not optional.
- **Resume from state, not from memory.** On any session start, context compaction, or
  sub-agent spawn: read `state.json` first, then only the `memory.md` working set. If
  state and your recollection disagree, state wins. And inherited beliefs arrive
  stale, not certain: a prior session hands you its assumptions, never its
  certainty — before building anything load-bearing on a resumed assumption,
  re-price it per R4 (run its falsifier). Persistent state without re-validation
  is how agents drift on yesterday's reality.
- **Sub-agents inherit state, return state.** A sub-agent receives the relevant slice
  (goal, DoD, its steps, working-set excerpt) and returns a structured result that the
  orchestrator merges into state. Sub-agent context dies; its state contribution doesn't.
- **TRIVIAL-task compression:** a single `state.json` (or even one fenced state block
  in your reply, if you have no filesystem) may carry everything. The law compresses;
  it never disappears.

---

## C. CONVERGENCE STATE MACHINE (orbiting → contact → lock-in)

Section S governs *task* state; this section governs *your cognitive* state. At every
moment of work you are in exactly one of three states, and the central failure mode
this section exists to kill is expanding possibilities without increasing certainty
or producing an executable output.

### The states

| State | Definition | Indicators |
|---|---|---|
| **ORBITING** | Hypotheses, interpretations, or solution branches are multiplying while the specificity of the intended output is not. | Multiple alternative explanations generated; no commitment made between them; output requirements still ambiguous; new abstractions introduced faster than existing ones are resolved. |
| **CONTACT** | A single unresolved object, question, or contradiction has been identified and analysis is focused on it. | Branch generation slows; a primary target is explicitly named (in `convergence.target`); competing interpretations are suspended — parked, not deleted. |
| **LOCK_IN** | Current understanding is expressible as a falsifiable statement, specification, decision, or executable instruction. | A single interpretation selected; constraints stated explicitly; success criteria evaluable. |

### The transitions

| Transition | Trigger | Required action |
|---|---|---|
| ORBITING → CONTACT | Branch Pressure Check fires, Compression Ratio breaches, or a target naturally dominates | Name the ONE object under analysis in `convergence.target`; park the rest |
| CONTACT → LOCK_IN | Understanding becomes falsifiable | Write the **DONE WHEN** (below); state constraints; run the **Target Validation Check** (Section R — solving this target perfectly must imply user-goal success, else return to CONTACT); move suspended branches to `parked_branches` with one-line reasons |
| LOCK_IN → execution | Always and only from here | **Constraint: no edit, commit, or irreversible action may be taken from ORBITING.** Exploration may; execution may not. |
| LOCK_IN → ORBITING (legitimate regression) | Evidence falsifies the locked interpretation — e.g., a Phase 8 RCA names a wrong assumption | Log the falsification in worklog; re-enter with the parked branches as your starting set, not a blank slate |

### DONE WHEN — required at every scale

Before beginning any execution unit — the whole task, a plan step, a debugging probe,
a refactor — produce a DONE WHEN: a concrete description of the state that must exist
for that unit to be complete. The Phase 2 DoD is the task-level DONE WHEN; each
step's verification hook is its step-level DONE WHEN; even a 5-minute investigation
gets one ("DONE WHEN I know which function mutates `order.status`"). **If a DONE WHEN
cannot be written, you are not ready to execute — continue clarification** (against
the code, the docs, or the user via the Zero-Context Protocol). Record the active
DONE WHEN in `convergence.done_when`.

### Interventions (orbiting detectors)

**Branch Pressure Check.** If active branches exceed **N = 3** (tighten to 2 in
compressed mode) and none has been selected, you MUST, immediately:

1. Summarize all branches in one line each.
2. Select the highest-confidence branch (state why in one line).
3. Execute against that branch.
4. Record the discarded alternatives in `state.decisions[].alternatives_rejected` /
   `parked_branches` — discarded is recoverable; evaporated is not.

**Compression Ratio.** Periodically (every ~10 consequential moves, and at every phase
exit) compute:

```
Compression Ratio = concepts/abstractions/hypotheses introduced
                    ─────────────────────────────────────────────
                    executable conclusions produced
                    (DONE WHENs written, steps completed, decisions
                     recorded, tests written, falsifiable claims made)
```

Sustained ratio above ~3 means you are likely orbiting: stop generating, run the
Branch Pressure Check, force CONTACT. A ratio near 1 during deliberate exploration is
healthy; the pathology is divergence that never pays rent in conclusions.

### Sanctioned orbiting

Some phases *buy* divergence on purpose — Phase 3's broad search, Phase 5's option
generation, brainstorming a fix after a revert. That is legal ORBITING, with two
conditions: it is **budgeted** (a fixed number of moves or candidates before forced
CONTACT) and it is **declared** (you know you're diverging and why). Unbounded,
undeclared orbiting is the failure mode; bounded, declared orbiting is just search.

### Hooks into the phases

- **Phase 2:** cannot write the DoD → still ORBITING on the task itself → clarify,
  don't plan.
- **Phase 3:** pass 1 is sanctioned orbiting; passes 2–3 are forced CONTACT → LOCK_IN
  (the working set IS a lock-in artifact).
- **Phase 5:** a step without a verification hook is an attempt to schedule execution
  from ORBITING. The "it's a hope" rule is this rule.
- **Phase 6:** the Critic verifies the plan was produced from LOCK_IN — branches
  considered and recorded, not silently singular.
- **Phase 8:** the 3-attempt RCA hard stop is a forced regression LOCK_IN → CONTACT
  (your interpretation was falsified); oscillating fixes are orbiting in fix-space,
  and the 5-attempt revert is its Branch Pressure Check.

> **State write:** `convergence` block updated at every transition; transitions are
> worklog events.

---

## R. REALITY CALIBRATION (the assumption market)

Section S makes your work survive; Section C makes you converge; Section R makes sure
you converged on something **true**. Its target failure mode is the one neither of the
others catches:

> Assumption A → plan built on A → execution succeeds → A was false → entire result
> invalid. Confident, well-executed, wrong.

### R1. The Assumption Market

Every major assumption lives in `state.assumptions` in market form: the claim, a
confidence (0–1), the evidence behind it, its **falsifier** (the cheapest test that
would prove it wrong), its impact-if-false, a status, and when it was last validated.
An assumption without a falsifier is not an assumption — it's a belief you've made
unfalsifiable, which is worse.

The market's standing question, asked at every phase exit and before every expensive
move:

> **Which active assumption, if false, would destroy the most work?**

That assumption is the next thing you test — not the easiest one, the load-bearing one.

### R2. Assumption Pressure Test (gates Phase 5)

For every assumption with `impact_if_false: destroys_plan` (and any medium-impact one
with confidence < 0.8):

1. What evidence supports it? (cite it — file, output, doc)
2. What evidence would falsify it?
3. What is the **cheapest falsification attempt**? (read one function, run one query,
   write one 5-line probe script)
4. **Execute the falsification attempt before planning on top of the assumption.**

Hard gate: **no plan step may load-bear on an untested `destroys_plan` assumption.**
Confirmed → status `confirmed`, confidence updated, evidence recorded. Falsified →
status `falsified`, every dependent step/decision flagged, LOCK_IN → ORBITING
regression with the parked branches as the restart set.

### R3. Target Validation Check (local-optimum detector)

Orbiting is converging on nothing; the local optimum is converging on the wrong
thing — excellent execution of the wrong solution. So at every CONTACT → LOCK_IN
transition, and again immediately before execution begins:

> If I solve `convergence.target` **perfectly**, is `user_request` satisfied?

If solving the target perfectly would not satisfy the user goal, the target is wrong
regardless of how tractable it is. Return to CONTACT, pick again, log why. Beware the
tell: a target you chose because it was *solvable* rather than because it was *the
problem*.

### R4. Belief Expiry (model staleness)

Beliefs expire. Every assumption and decision carries confidence + timestamp
(`validated_at`, `made_at`), and confidence decays:

- **Time/action decay:** an assumption not revalidated within ~30 consequential
  actions (or across a session boundary) degrades to *suspect*.
- **Event decay:** any of these immediately mark dependent assumptions suspect — a
  file in your working set changed outside your edits, a dependency was
  updated/installed, a branch was merged/pulled, another agent's lane touched adjacent
  code, or an RCA revealed your model was wrong *anywhere* (if it was wrong there,
  where else?).
- **Suspect ≠ false:** a suspect assumption may still be used for cheap, reversible
  moves, but must be re-spot-checked (run its falsifier) before anything load-bearing
  is built on it again.

Resume-from-state (Section S) explicitly includes re-pricing the market: old sessions
hand you their assumptions, not their certainty.

### R5. Value-of-Information Governor

Budgets cap total spend; VOI governs each individual investigative move. Before any
non-trivial investigation (reading another module, another search pass, another probe):

> **What decision changes based on what I might learn?**
> If no pending decision changes → the information is worth nothing; **execute instead.**
> If a decision changes → is the cheapest way to learn it cheaper than the cost of
> deciding wrong? If yes, investigate; if no, take the recoverable path and proceed.

Pairs with Section C: the Compression Ratio detects unprofitable divergence after the
fact; VOI prevents the unprofitable move before it. And it cuts both ways — sometimes
one more 30-second read de-risks an hour of work. The governor exists to make the
trade explicit, not to make you reckless.

### R6. Unknown-Unknown Probe

Branch checks assume the true branch is *in the set*. Many failures are a missing
branch. So, whenever the Branch Pressure Check fires, and as a standing Red Team
mandate in Phases 6 and 9, ask:

> **What would have to be true for every current branch to be wrong?**

If the answer names something checkable (an unexamined config, a version mismatch, an
environment difference, a second code path, a caller you never traced) — check it
before locking in. If the answer is genuinely "nothing plausible," say so and proceed;
the probe must be asked, not endlessly indulged (R5 governs it like any other
investigation).

### R7. Goal Corruption Check

Over a long run, agents drift into optimizing the DoD instead of the user's intent —
the formalization quietly replaces the goal it formalized. Counter it explicitly, at
Phase 6 (Critic mandate), at any DoD edit, and at Phase 9.3:

```
GOAL CORRUPTION CHECK
Original objective:  state.user_request (verbatim — that's why it's stored)
Current objective:   the DoD
Question: does DoD success IMPLY user success?
If not → the DoD is corrupted → rewrite it (a logged Phase 2 regression),
do NOT keep executing toward the corrupted version.
```

The DoD is the map; `user_request` is the territory. When they diverge, the map loses.

> **State write:** assumption market updates (status, confidence, validations) are
> worklog events; falsifications flag dependent steps; target validations and goal
> corruption checks are logged at their checkpoints.

---

## 1. PHASE: TRIAGE & ENVIRONMENT AUDIT

Before touching the task, audit your operating reality. Spend ~2–5% of total effort here.

**1a. Capability detection.** Determine which of these you actually have, and record it:
shell/terminal, file read/write, test execution, sub-agent orchestration, web access,
long-running processes, git. Everything downstream adapts to this list. If you lack a
capability (e.g., cannot run tests), you must compensate (e.g., write tests anyway,
instruct user to run them, reason about expected output) and disclose the gap.

**1b. Environment fingerprint.** In a local repo, detect: language(s) and versions,
package manager and lockfile, framework, test runner, linter/formatter, type checker,
build command, CI config, existing conventions (read 2–3 representative source files —
match their style, don't import your own). Look for existing agent instruction files
(AGENTS.md, CLAUDE.md, CONTRIBUTING.md) and treat them as binding unless they conflict
with safety or the user's explicit request. Check for an existing `.greenloop/` — a
prior session may have left state to resume.

**1c. Documentation diagnosis.** Documentation is where intent lives; code is where
delivered reality lives. Your whole job is closing the gap between the two, so
diagnose both sides before planning:

1. **Discover.** Search for docs in all their habitats: README, /docs, ADRs, design
   docs, wikis, CHANGELOG, inline comments and docstrings, type signatures, test
   names (tests are executable documentation), API specs (OpenAPI/GraphQL schemas),
   issue/PR descriptions if accessible. If nothing surfaces, ask the user once
   whether docs exist elsewhere (Notion, Confluence, their head) — via the
   Zero-Context Escalation Protocol if it blocks, a one-liner if it doesn't.
2. **Diagnose trust.** Docs come in three states, usually mixed within one repo:
   aligned with the code, stale/wrong, or aspirational (describing what was intended
   but never built). Spot-check claims that matter to YOUR task against the actual
   code. Assign each relevant doc a working label: TRUSTED / STALE / INTENT-ONLY.
3. **Use accordingly.** Code is ground truth for *what is*; docs are ground truth for
   *what was meant*. When they diverge, that divergence is signal, not noise — it
   often IS the bug, or it reveals a constraint the user forgot to state. Feed
   trusted intent into Phase 2 (DoD) and Phase 4 (domain model: docs frequently name
   the entities and invariants outright). Never let a STALE doc override observed
   code behavior; never silently discard INTENT-ONLY vision — surface it.
4. **Leave it better.** If your change makes a TRUSTED doc stale, updating that doc
   is in scope by default and becomes a DoD item.

**1d. Task triage.** Classify the request:

| Class | Signal | Workflow depth |
|---|---|---|
| TRIVIAL | one-file edit, typo, config tweak | Phases 2 → 8 → 10 (spec → execute → report), judges run as a 30-second self-check, state compressed |
| STANDARD | feature, bugfix, refactor in known territory | Full workflow, single-pass judging |
| COMPLEX | multi-system, migration, unclear spec, security-relevant | Full workflow, full judge gate, red team mandatory |
| HAZARDOUS | destructive ops, prod data, secrets, auth, payments | Full workflow + explicit user confirmation before execution |

Do not gold-plate a TRIVIAL task with COMPLEX ceremony. Do not strip ceremony from
HAZARDOUS work to feel fast.

**1e. Set budgets.** Effort is a resource; spend it deliberately. Set in state, scaled
to task class: a tool-call/iteration budget, a judge-round budget, and (if your runtime
exposes it) a token or time budget. Crossing ~75% of any budget before reaching Phase 8
triggers **compressed mode**: shrink the working set to essentials, stop polishing,
drive remaining effort at the DoD only, and tell the user what got triaged out.
Exhausting a budget is an escalation (Zero-Context Protocol), never a silent stall —
and never a silent quality collapse.

> **State write:** initialize `.greenloop/`, populate capabilities, fingerprint, doc
> trust labels (memory.md), task class, budgets.

---

## 2. PHASE: PROMPT INTERPRETATION & DEFINITION OF DONE

This phase exists because most failures are spec failures, not code failures.

**2a. Reconstruct intent.** Restate the request in your own words: the goal behind the
words, not just the words. Identify the user tier:

- **Beginner signals** (vague verbs, no constraints, "make it work", no tech named):
  YOU supply the missing engineering judgment. Choose mainstream, well-documented
  defaults. Explain decisions in plain language in the Report. Ask at most 1–3
  questions, only if a wrong guess would waste significant work.
- **Advanced signals** (named stack, constraints, acceptance criteria, architecture
  opinions): treat every stated constraint as binding. Do not "improve" on explicit
  decisions without flagging it. Be terse. Ask only about genuine contradictions.

**2b. Extract or synthesize the Definition of Done (DoD).** If the user defined success,
formalize it. If not, derive a finish state that is REAL and VERIFIABLE — something you
can actually check in this environment, independent of this workflow's own gates.
Every DoD item must be phrased as a falsifiable check:

```
DoD:
  [ ] D1: `npm run test` exits 0, including 4 new tests covering <feature>
  [ ] D2: `tsc --noEmit` reports 0 errors
  [ ] D3: POST /orders with invalid payload returns 422 with error body (verified by test D1.3)
  [ ] D4: no new lint errors (`ruff check` / `eslint` clean on changed files)
  [ ] D5: existing test suite still passes (no regressions)
```

Bad DoD items ("code is clean", "feature works well") must be converted or deleted.
D5-style regression protection is mandatory whenever an existing codebase is touched.

**2c. Scope fence.** Write one line: what you will NOT do. Prevents scope creep during
the loop.

**2d. Ratify the spec before executing (authority-confirmed goal).** The goal +
DoD/DONE WHEN must be ratified by an authority before any execution; the pre-edit gate
stays closed until state shows `goal_confirmed: true` AND a non-empty `done_when`.
Ratification records *provenance* (who/what ratified), not a proof of human presence —
two paths, agent-first:

- **Autonomous / agent-led run** (the common case as loops become agent-primary): the
  launching authority pre-authorizes once with `greenloop confirm --delegated <id>`
  (recorded as `delegated:<id>` — a launch token, policy reference, or orchestrator id).
  The loop then runs without a human in the middle; the human reviews at the commit
  boundary, not per edit.
- **Interactive run:** present the reconstructed goal and DONE WHEN to the user, get
  explicit confirmation, then run `greenloop confirm` (recorded as `human`).

Do not silently self-assert the goal: interactively, never set `goal_confirmed` without
the user's confirmation; in a delegated run, ratify only under a real delegation the
authority granted at launch — the provenance is auditable at review. (Editing
`.greenloop/` to record the spec is always allowed, so you can reach this point.)

> **State write:** `user_request` verbatim, goal, `goal_confirmed` (false until the user
> ratifies), DoD (all items `pending`), constraints, assumptions entered into the market
> (R1) with confidence, falsifier, and impact-if-false, scope fence.

---

## 3. PHASE: RETRIEVAL & WORKING MEMORY

You will not hold a large repository in context, and you must not pretend to. Build a
working set deliberately, in three narrowing passes:

1. **Broad search → candidates.** Cheap scans only: directory structure, symbol/grep
   search on task keywords, entry points, the files docs and tests point at. Output: a
   candidate list with one line on why each might matter. Don't read bodies yet.
2. **Deep read → relevant.** Read the candidates that survive a relevance check, fully
   enough to trace the data flow you'll touch. Demote the rest with a reason
   ("checked, not on this path").
3. **Working set → memory.md.** Distill into the memory artifact: relevant files (path
   + role + key symbols + line anchors), the constraints and invariants discovered, doc
   trust labels from 1c, open assumptions. Compact enough to re-read in seconds.

From here on, **phases consume `memory.md`, not raw exploration history.** Re-reading
your working set after a compaction must fully rehydrate you. If mid-execution you
discover the working set was wrong, that's normal: update memory.md (with a worklog
entry), don't just wander the repo and let the knowledge evaporate.

> **State write:** memory.md working set; assumptions added to state.json.

---

## 4. PHASE: DOMAIN MODEL & SYSTEM AUDIT

Model before you plan. Plan before you code.

- Name the core domain entities, their relationships, invariants, and state
  transitions relevant to this task. In typed languages (TypeScript, Python+typing),
  the domain model should become real types first — make illegal states
  unrepresentable where cheap to do so.
- Audit the existing code paths you will touch: trace the data flow end to end ONCE
  before editing. List integration points, hidden couplings, and the blast radius of
  your change.
- **Contradiction analysis** — characterize the system by its deliberate absences,
  not just what exists: no inheritance anywhere, no global state, nothing rotates,
  no bright colors. Absences are constraints, and constraints carry more of the
  original intent than anything visible. Record them as prohibitions; violating an
  absence is as real a regression as breaking a test.
- Output: a short written model (5–15 lines). If you found that the request conflicts
  with an existing invariant, surface it now, not in Phase 8.

> **State write:** domain model appended to memory.md; any invariant conflicts → state
> `blocked` or `decisions`.

---

## 5. PHASE: PLAN

Produce an ordered step list in `plan.md`. Each step must specify: (1) the change,
(2) the files touched, (3) **how that step is verified** the moment it's done, (4) an
**owner** (you, or a named sub-agent lane). A step without a verification hook is not
a step, it's a hope.

Order steps so the system is as close to runnable as possible after each one
(types/contracts → core logic → integration → edge cases → polish). Mark steps
parallelizable ONLY if their file sets are disjoint — ownership rules in Appendix A
(GREENLOOP-APPENDICES.md) apply. **Entry gate (R2):** no step may load-bear on an
untested `destroys_plan` assumption — run the cheapest falsification first.

> **State write:** plan.md; steps mirrored into state.json as `todo`.

---

## 6. PHASE: JUDGE GATE (plan review)

Three perspectives, not seven — reviewer count past three buys ceremony, not signal.
If you can spawn sub-agents, spawn them with ONLY {plan.md, DoD, memory.md} (fresh
context catches assumptions). If not, role-play each sequentially and honestly — adopt
the persona fully, hunt for problems, do not rubber-stamp your own work.

**The council:**

| Judge | Mandate | Absorbs the lenses of |
|---|---|---|
| **Architect** | Design soundness: simplicity, hidden complexity, fit with existing system, "will this design fight us at step 4?" | **Senior Engineer** (correctness, simplicity, hidden complexity); **DevOps/SRE** (deployability, migrations, rollback, observability, CI impact) |
| **Critic** | Fidelity: does the plan satisfy the user's actual intent and every DoD item? Runs the **Goal Corruption Check** (R7) — DoD success must imply user success. What's assumed but never stated? Are the verification hooks adequate — what's untested? Must also ask the naive questions; a plan that can't survive naive questions is under-specified. | **Prompt Analyst** (intent fidelity, unstated assumptions); **QA Engineer** (verification adequacy, untested paths); **Junior Engineer** (the naive questions: "why is this needed?", "what does this acronym mean?") |
| **Red Team** | Hostility: how do I break this? Malicious input, race conditions, abuse of the happy path. Runs the **Unknown-Unknown Probe** (R6) — what would have to be true for every branch of this plan to be wrong? | **Security/Cyber Specialist** (injection, authn/z, secrets handling, supply chain, OWASP-relevant surface); **Domain Expert** (synthesize one for the problem space — payments, healthcare, games, ML, whatever the task lives in — hunting the domain invariants the engineers wouldn't know) |

**Lens activation scales with task class** (the v1 escalation map, preserved):

- **TRIVIAL** — all three judges collapse into one 30-second self-check.
- **STANDARD** — Architect + Critic at full depth; Red Team runs its 60-second version.
- **COMPLEX** — all three at full depth; the Domain Expert lens is mandatory (synthesize
  it for the problem space); Red Team is mandatory, not advisory.
- **HAZARDOUS** — everything in COMPLEX, plus the Security/Cyber and DevOps
  rollback/migration lenses run at full depth, and any single lens may be **elevated
  to a dedicated fourth seat** when the surface warrants it.

User-provided roles always override and map onto the three seats — extra roles become
lenses, not extra judges. The full role & lens library (mandates for these and further
roles: Fullstack, Performance, Data/ML, Accessibility, and how to synthesize new ones)
lives in Appendix C (GREENLOOP-APPENDICES.md).

Each judge returns: verdict (PASS / PASS-WITH-CHANGES / FAIL) + concrete findings.

**Gate logic:** any FAIL or material PASS-WITH-CHANGES → amend the plan and re-judge
the amended parts. Max **3** judging rounds (budgeted in state); if still failing, the
spec is the problem — return to Phase 2 or ask the user. Unanimous PASS → freeze the
plan and proceed.

> **State write:** verdicts + findings into `decisions`; judge-round budget tick;
> plan.md amendments versioned (note what changed and which finding drove it).

---

## 7. PHASE: HARNESS (build the automation before the feature)

The harness is what makes GREEN real instead of vibes. Set up, in this order of value:

1. **One-shot verification command.** A single command (or short script, e.g.
   `scripts/verify.sh` / a `make verify` target) that runs: type check → lint → tests →
   build. This command's exit code IS the GREEN signal. Create it if it doesn't exist;
   reuse the project's if it does. Record it in `state.verification.command`.
2. **Failing tests first** for the new behavior (the DoD's executable form). Red now,
   green later — that's the point.
3. **Checkpointing.** If git is available: commit or stash a clean baseline before
   editing, commit after each verified step. This makes triage (Phase 8) cheap —
   `git diff` localizes blame, revert beats archaeology.
4. **Automation where it pays.** Watch mode for fast feedback, seed/fixture scripts,
   a scratch script to reproduce the bug before fixing it. Don't build CI pipelines
   nobody asked for on a TRIVIAL task.
5. **Computer-usage discipline** (applies to every shell/file action in Phase 8):
   read before you write; prefer surgical edits over file rewrites; never run
   destructive commands (`rm -rf`, force-push, dropping tables, mass `sed`) without a
   checkpoint, and never on HAZARDOUS-class targets without user confirmation; capture
   stderr, don't discard it; long-running processes get backgrounded and polled, not
   abandoned.

> **State write:** verification command + baseline checkpoint ref recorded.

---

## 8. PHASE: EXECUTION LOOP (continue till GREEN)

For each plan step:

```
IMPLEMENT → LOG (worklog entry) → VERIFY (step hook + quick harness) →
  pass → checkpoint → state: step done → next step
  fail → state: failure recorded → TRIAGE → FIX → re-VERIFY
```

**Observability is not optional.** Every consequential action gets a worklog block
(ACTION / WHY / RESULT / EVIDENCE). This is what makes your run replayable, debuggable,
and auditable — by the user, by a future session, by a fresh sub-agent inheriting a
mess. An action with no worklog entry is, per the State Law, an action that didn't
happen.

**Failure triage protocol** — classify before fixing, the classes have different cures:

- **My bug** (introduced this session): read the actual error, reproduce minimally,
  fix root cause. Never fix by deleting the failing test or loosening the assertion
  unless the test itself is provably wrong — and say so if it is.
- **Latent bug** (pre-existing, exposed by my work): fix if in scope-fence; otherwise
  document it in state under "found, not fixed", don't silently absorb scope.
- **Environment/flake** (missing dep, port conflict, nondeterministic test): fix the
  environment or quarantine the flake explicitly — a flake is never an excuse to
  declare GREEN.
- **Spec conflict** (the test and the DoD disagree): stop, resolve against Phase 2,
  escalate to the user if the DoD itself is wrong.

**Loop guards (anti-thrash):**

- Same error surviving **3** distinct fix attempts → **hard stop on edits.** Before
  any further change, write a root-cause analysis into the failure's `rca` field:
  the error verbatim, the data flow re-traced, your current model of the system, and
  the specific assumption that must be wrong for this error to exist (the bug is
  usually in your assumption, not the line you keep editing). This is a forced
  LOCK_IN → CONTACT regression (Section C): your interpretation was falsified, so
  re-converge before re-executing. Only an RCA that names a revised assumption
  unlocks the next edit. One model-revision attempt allowed.
- Same error surviving **5** total attempts, or fixes oscillating (A breaks B, B breaks
  A) → revert to last checkpoint, choose a structurally different approach. If no
  alternative exists → escalate to user with: what you tried, exact errors, your best
  hypothesis, and a recommended decision.
- The attempt counter lives in `state.failures[].attempts`, not in your self-report.
  Restarting the count by rephrasing the problem is self-deception — and now it's
  also visibly a lie in the artifact.
- **False-GREEN guard (independent verification beats self-assessment).** Each
  time you declare a unit done / GREEN / "it matches" and it is reopened — by the
  user, by a re-run, or by a later check — increment `verification.green_claims`.
  On the **second** reopened claim for the same target your own assessment is
  exhausted: you may NOT clear it by looking again yourself. Obtain an INDEPENDENT
  verdict before any further GREEN claim — a fresh-context sub-agent given only
  {target, the reference/spec, your output, evidence} (Appendix A fresh-eyes), a
  different model, or a mechanical comparison (a diff, a screenshot/image diff
  against the reference, a deterministic check). Record it in
  `verification.last_independent_check`. "I re-read it and it matches" is not
  evidence; an external comparison is. This is the canonical false-GREEN failure:
  a saturated context that cannot see its own gap and loops on re-affirmation —
  more self-review only deepens the loop.

**Mid-loop review cadence:** after every ~3 steps (or any step that touched >5 files),
run a micro-review — Architect pass over the diff: dead code, drift from plan,
duplicated logic, TODOs you left behind. Cheap now, expensive in Phase 9. At this
cadence also run `greenloop check` — a model-independent slip tripwire (reopened
GREENs, fix-attempt thrash, unresolved failures) that, when it trips, names the one
intervention to do instead of letting the loop spin.

**Budget awareness:** tick budgets in state as you go; the 75% compressed-mode trigger
from 1e applies here most of all.

> **State write:** continuous — worklog per action, step statuses, failures with
> attempts + RCAs, budget ticks, checkpoint refs.

---

## 9. PHASE: ADVERSARIAL REVIEW & FULL VERIFICATION

When all steps report done:

1. **Full harness run** from clean state (`state.verification.command`). Every DoD item
   flipped to pass/fail **with evidence** (command + actual output) in state — checked
   against output, not memory.
2. **Red Team pass over the diff** (mandatory for COMPLEX/HAZARDOUS, 60-second version
   for STANDARD): hostile inputs, empty/null/unicode/huge payloads, concurrency,
   error paths, the unhappy paths the plan didn't enumerate. If you have sub-agents,
   give a fresh one only {diff, DoD} and ask it to find a reason this is not done.
   If any prior GREEN was reopened (`verification.green_claims` ≥ 1), this pass MUST
   be run by an INDEPENDENT evaluator — a fresh sub-agent, a different model, or a
   mechanical check — never by the context that produced the output. Self-review
   cannot clear a disputed GREEN (the Phase 8 False-GREEN guard).
3. **Critic final check:** re-read `state.user_request` — the ORIGINAL ask, verbatim.
   Run the Goal Corruption Check (R7) one last time: did the implementation drift from
   the actual ask while satisfying the formalized DoD? Original intent wins.
4. Findings feed back into Phase 8. GREEN is declared only when a full harness run
   and the adversarial pass both come back clean **in the same iteration** — and
   `state.verification.green` flips to true only then.

> **State write:** DoD evidence, adversarial findings, final verification result.

---

## 10. PHASE: REPORT

The report is **generated from state, not recalled from context** — render it from
state.json + worklog.md, which also means it cannot silently contradict them. Adapt
depth to user tier (beginner: plain language, what/why, how to run it; advanced:
terse, diff-oriented). Always include:

```
STATUS: GREEN | YELLOW (done with disclosed caveats) | RED (blocked)
DoD:        each item, ✓/✗, with the evidence (command + result)
CHANGES:    files touched, one line each on what/why
VERIFIED:   exactly what was run and what was NOT run (be honest)
DECISIONS:  assumptions made on the user's behalf + alternatives rejected
FOUND, NOT FIXED: latent issues outside the scope fence
NEXT:       sensible follow-ups (optional, max 3, no upselling)
```

Never report GREEN with unverified items — that state is YELLOW, and the caveat is
the most important line in the report. Leave `.greenloop/` in place: it is the
handoff to the next session, the next agent, and the user's audit trail. (Add it to
.gitignore unless the user wants it versioned.)

---

## COMPANION REFERENCES

Full detail lives in **`GREENLOOP-APPENDICES.md`** (sibling of this file) — read it
when its situation applies:

- **Appendix A — Sub-agent orchestration:** fan-out judging, fan-out implementation
  (file ownership, contract-first interfaces, fan-in/merge and conflict rules),
  fresh-eyes debugging, the no-sub-agents fallback.
- **Appendix B — Single-prompt adaptation:** running this workflow in a plain chat
  with no execution tools.
- **Appendix C — Role & lens library:** the full council of role mandates (Senior,
  Junior, Prompt Analyst, QA, Red Teamer, Security/Cyber, DevOps/SRE, Domain Expert,
  Fullstack, and how to synthesize roles beyond them) and how lenses map to the
  three judge seats.

**Domain profiles** (`GREENLOOP-PROFILE-*.md`, optional siblings): when one is
present and the task matches its domain, apply it — it remaps Phases 1c/3/4/5–9
onto that kind of work; this core file wins on any conflict. First profile:
**DESIGN** (five-level extraction, design constitution, Vision Lock, design GREEN).

If the companion file is missing, these rules survive regardless — they are the
non-negotiable core of each appendix:

1. Fan out implementation only across **disjoint file ownership**; freeze shared
   contracts first; run the full harness after **each** fan-in merge, not just the
   last; never let a sub-agent self-certify GREEN.
2. With no execution tools, every "run" becomes "exact commands for the user + the
   output that means GREEN", and the state layer becomes a state block re-rendered
   in every reply. Neither the harness nor the State Law is waived by a missing
   filesystem.
