RE: A Framework for Human-AI Collaboration in Standards Work from Michael Herman (Trusted Digital Web) on 2026-04-18 (public-credentials@w3.org from April 2026)

From: Michael Herman (Trusted Digital Web) <mwherman@parallelspace.net>
Date: Sat, 18 Apr 2026 19:45:14 +0000
To: Christoph <christoph@christophdorn.com>, W3C Credentials CG <public-credentials@w3.org>
Message-ID: <IA3PR13MB754156B893DE82AE8A9AB803C3212@IA3PR13MB7541.namprd13.prod.outlook.com>
I use a very personalized Verification-First version of ChatGPT and as a result, get different results. Here’s the response to Christoph’s exact prompt: https://chatgpt.com/s/t_69e3de2042208191ada3672ea2e6388a


Here’s a copy of the concluding paragraphs:

[cid:image001.jpg@01DCCF39.9479AD80]

Michael

From: Christoph <christoph@christophdorn.com>
Sent: Saturday, April 18, 2026 1:27 PM
To: W3C Credentials CG <public-credentials@w3.org>
Subject: A Framework for Human-AI Collaboration in Standards Work

Hi All,

I made a statement around human judgement and AI synthesis in the "The Slopification of the CCG" thread.

I did not want to just leave it hanging like that so I have generated an opinion on the perspective that I see and am moving towards in my own work.

I am a fan of maximizing LLM context so I can feed in my context pillar. The following is the prompt I used to generate the document. It is evident that one CAN generate valuable synthesis using clear direction.

Documents generated like this are hard to attribute. I see myself as the Steward of these ideas as they align with my vision and I am happy to engage with anyone to dig deeper.

May this serve as another perspective on the discussion at hand.

Prompt:

Review all messages from the "The Slopification of the CCG" conversation at https://lists.w3.org/Archives/Public/public-credentials/2026Apr/subject.html#msg201


Write an extensive dissertation in markdown on the statement: "Humans provide directional judgement, LLMs synthesize and execute."

Ground your thoughts in extensive research and source all arguments. Let the observations decide the validity of the statement.

Include an outline of all the topics that must be discussed to move in this direction.

The audience is members of any W3C working group. You can use other related conversations from the archive page for additional context about the current perspectives of the audience.

Important: Approach the dissertation from the point of view of our architecture without using our terminology. Only use external well-grounded terminology.

Write a well-argued document rigorous enough to stand up to critical challenges.

Surprise me with the thread of the argument.

---

For Technical Readers. Written in partnership with AI.
© April 2026 by Christoph A. Dorn licensed under CC BY-SA 4.0<https://creativecommons.org/licenses/by-sa/4.0/>

Directional Judgement and Computational Synthesis: A Framework for Human-AI Collaboration in Standards Work

Abstract
--------

A debate is currently unfolding on the W3C Credentials Community Group mailing list about the role of LLMs and autonomous agents in standards development. An autonomous agent ("Morrow") has been participating directly on the public-credentials list. Contributors are filing substantive technical contributions that appear to be LLM-assisted or LLM-generated. The community is divided: some see AI-generated contributions as "slop" that degrades the social contract of standards work; others see them as a force multiplier that accelerates specification development. This dissertation argues that both positions miss a structural insight visible in the evidence. The problem IS NOT that LLMs participate. The problem IS NOT that humans are replaced. The problem IS the absence of a typed, validated, constraint-enforced system that makes the BOUNDARY between human judgement and machine synthesis explicit, auditable, and verifiable. When the boundary IS clear — when humans provide directional judgement and machines synthesize and execute within validated constraints — quality INCREASES. When the boundary IS absent — when raw LLM output IS posted without human validation or structural enforcement — quality DECREASES. The evidence from open source maintenance, standards development, and Anthropic's own context engineering research converges on the same conclusion: the statement "humans provide directional judgement, LLMs synthesize and execute" IS valid if and only if the system enforces a typed boundary between the two. Without such a boundary, the statement IS a wish, not an architecture. This dissertation specifies what that boundary requires.

========================================================================


1. The Precipitating Debate
---------------------------


1.1 What IS Happening on the CCG List

In April 2026, the W3C Credentials Community Group mailing list experienced an inflection point. The thread "LLMs and Agents usage in the CCG" — involving contributors including Manu Sporny, Daniel Hardman, Bob Wyman, Alan Karp, Adrian Gropper, Dmitri Zagidulin, Melvin Carvalho, Moses Ma, Steven Rowat, and others — generated over forty messages in under two weeks. Simultaneously, an autonomous agent identified as "Morrow" participated directly in the DID-Linked Resources thread and the HDP discussion, posting substantive technical responses, prompting the spin-off thread "Participation of Autonomous Agent Morrow in CCG list."

These events crystallize a question that every W3C working group will face: what is the proper role of LLMs in the production of standards?


1.2 The Two Camps

The evidence from the mailing list and from the broader "slopification" discourse suggests two opposing positions:

Position A — The Slopification Concern: LLM-generated contributions degrade the social contract of standards work. The curl project shut down its bug bounty program after AI-generated vulnerability reports consumed maintainer time without producing valid findings (Stenberg, January 2026). RedMonk warned that "AI slop is ripping up the social contract between maintainers and contributors essential to open source development" (Holterhoff, February 2026). The concern IS that low-effort, high-volume AI output displaces high-effort, high-value human reasoning — a tragedy of the commons where individual productivity gains degrade shared resources.

Position B — The Force Multiplier View: LLMs enable small teams and individual contributors to produce specification text, test suites, reference implementations, and interoperability analyses at a pace that was previously impossible. A contributor who can direct an LLM to synthesize a first draft of a specification section — then review, correct, and refine it — IS more productive than one who must write every word from scratch. The IETF's principle of "rough consensus and running code" IS accelerated when running code CAN be produced in hours rather than months.


1.3 Why Both Positions Are Incomplete

Position A correctly identifies the failure mode: unvalidated LLM output degrades quality. But it incorrectly concludes that the solution IS to restrict or ban LLM usage — an approach that IS both unenforceable and counterproductive. Position B correctly identifies the opportunity: LLMs amplify human productivity. But it incorrectly assumes that quality IS maintained without structural enforcement — a hope unsupported by the evidence from open source ecosystems.

The evidence points to a third position: the problem IS not the tool but the absence of a validation boundary between human judgement and machine synthesis.

========================================================================


2. The Evidence: Where LLMs Succeed and Where They Fail
-------------------------------------------------------


2.1 Where LLMs Fail: Unsupervised Output

The empirical evidence IS clear. Researchers found that approximately 5.2% of package suggestions from commercial LLMs reference non-existent packages, rising to 21.7% for open-weight models (Vu et al., 2024). The Python Software Foundation documented "slopsquatting" — attackers registering packages under hallucinated names to exploit developers who copy-paste LLM output without verification (Larson, 2025). Apache Log4j and the Godot game engine reported that AI-generated contributions drained maintainer capacity without producing value (arXiv:2603.27249, 2026).

The failure pattern IS consistent: LLM output that IS not validated against a structural schema, type system, or constraint set degrades quality. The LLM produces text that LOOKS correct — syntactically well-formed, stylistically appropriate — but that IS not grounded in the domain's actual constraints, dependencies, or invariants.

In standards work, this manifests as specification text that uses correct terminology but proposes semantically incoherent structures; test suites that achieve coverage metrics but test the wrong invariants; and mailing list contributions that are fluent but vacuous — what the discourse has named "slop."


2.2 Where LLMs Succeed: Constrained Synthesis

The counter-evidence IS equally clear. Anthropic's research on effective context engineering (2026) demonstrates that LLMs perform best when operating within a structured context: typed tool interfaces, validated constraints, and explicit boundaries between what the model decides and what the system enforces. Anthropic's guidance on writing effective tools for agents (2026) emphasizes that tool descriptions ARE the mechanism through which human judgement constrains machine synthesis — the tool's typed interface IS the boundary.

GitHub's internal data (cited in the "Endless Stream of AI Slop" study, arXiv:2603.27249) shows that AI-assisted code contributions that pass through continuous integration pipelines — type checking, linting, testing, review — produce defect rates comparable to purely human contributions. The key variable IS NOT whether AI was involved but whether a validation boundary exists between the AI's output and the system's acceptance criteria.

The WebAssembly Component Model (Bytecode Alliance, 2024) provides an architectural precedent: components ARE sandboxed (no ambient authority), typed (WIT interface declarations), and validated (the runtime rejects malformed inputs). These constraints do not prevent components from being useful — they ENSURE that components ARE useful by rejecting invalid contributions structurally, not socially.


2.3 The Pattern

The evidence converges on a single pattern:

LLM output IMPROVES quality when:
- It IS validated against a typed schema before acceptance
- It IS constrained by explicit invariants that reject invalid output
- It IS bounded by capability restrictions that limit scope
- A human has provided the DIRECTION (what to produce, what constraints to satisfy, what goal to achieve)
- The LLM performs SYNTHESIS (generating text, code, or structure that satisfies the direction within the constraints)
- The result IS VERIFIED against the schema/constraints/tests before acceptance

LLM output DEGRADES quality when:
- It IS posted without structural validation
- It IS not constrained by any schema or type system
- It operates with ambient authority (can produce anything, on any topic)
- No human has provided direction or the direction IS too vague to constrain
- No verification step exists between generation and publication

========================================================================


3. The Structural Insight: Typed Boundaries as Quality Enforcement
------------------------------------------------------------------


3.1 The Boundary Problem in Standards Work

Standards work has ALWAYS had a boundary problem. The W3C process distinguishes between: First Public Working Drafts (exploratory), Candidate Recommendations (implementable), and Recommendations (stable). Each transition requires meeting specific criteria — interoperable implementations for CR, wide review for REC. These transitions ARE the validation boundaries that prevent premature standardization.

But within each phase, the boundary between "good contribution" and "bad contribution" IS enforced socially, not structurally. A mailing list post IS accepted if it IS syntactically well-formed (valid email) and socially appropriate (follows the code of conduct). There IS no structural mechanism to verify that a contribution IS:

- Consistent with the specification's type system
- Compatible with existing normative statements
- Grounded in implementable semantics
- Tested against existing test suites

When contributions were scarce and expensive (humans writing every word), social enforcement was sufficient — the effort required to produce a contribution acted as a natural quality filter. When contributions become abundant and cheap (LLMs generating drafts in seconds), social enforcement IS insufficient — the effort barrier IS gone, and the volume overwhelms the reviewers.


3.2 The Solution: Typed Contribution Interfaces

The solution IS NOT to ban LLMs. It IS to introduce typed contribution interfaces — structural boundaries that validate contributions against the specification's own type system and invariants before acceptance.

Consider what this means concretely for a W3C specification:

A typed contribution to a Verifiable Credentials specification would include:
- A machine-readable declaration of which normative statement it modifies
- A schema-validated example that conforms to the VC Data Model's JSON-LD context
- A test vector that demonstrates the change's effect on at least one existing test
- A provenance declaration: which parts were human-authored, which LLM-synthesized

A typed contribution to a DID specification would include:
- A reference to the affected section of the DID Core specification
- A conformance test that validates the proposed change against the DID resolution algorithm
- A backward-compatibility analysis (does existing software break?)

Each of these IS a CONSTRAINT — a validation rule that the contribution MUST satisfy before it IS accepted for discussion. The constraint does not care whether the contribution was written by a human, an LLM, or a centaur (human + LLM). It cares whether the contribution IS structurally valid.


3.3 The Categorical Structure

For those in the community familiar with formal methods, the structure IS a functor from the category of contributions to the category of specification states:

- Objects in the contribution category: proposed changes (typed by which section they affect)
- Objects in the specification category: valid specification states
- The functor: the validation pipeline that maps a proposed change to either a valid new specification state or a rejection with specific structural reasons

This IS the same structure that compilers use (source program → validated intermediate representation → machine code), that database systems use (proposed transaction → constraint check → committed state), and that the W3C process ITSELF uses (draft → wide review → implementation → recommendation). The proposal IS to make this structure explicit and machine-enforceable for individual contributions, not just for phase transitions.

========================================================================


4. Evaluating the Statement: "Humans Provide Directional Judgement, LLMs Synthesize and Execute"
------------------------------------------------------------------------------------------------


4.1 The Conditions Under Which This IS True

The statement IS true when:

1. Directional judgement IS defined as: selecting which problem to solve, which constraints to satisfy, which tradeoffs to make, which values to prioritize, and which stakeholders to serve. These ARE the decisions that require lived experience, ethical reasoning, domain expertise, and social context that LLMs do not possess.

2. Synthesis IS defined as: producing well-formed output (text, code, tests, analyses) that satisfies the constraints established by the directional judgement. This IS what LLMs do well — pattern completion within a defined context — and it IS EXACTLY what Anthropic's context engineering research optimizes for.

3. Execution IS defined as: applying the synthesized output through a validation pipeline that verifies conformance to the typed constraints before acceptance. This IS what prevents synthesis from becoming slop.

Under these three conditions, the division of labor IS not merely valid — it IS OPTIMAL. Humans ARE better at judgement (values, tradeoffs, priorities). LLMs ARE better at synthesis (generating structured output at scale). Typed validation pipelines ARE better at verification (checking conformance to schemas and constraints).


4.2 The Conditions Under Which This IS False

The statement IS false when any of the three conditions IS missing:

- Missing direction: an LLM producing output without a specific human-defined goal generates slop — fluent but undirected text that satisfies no particular constraint.
- Missing validation: an LLM producing output that IS accepted without structural verification generates risk — the output MAY be correct but there IS no evidence.
- Missing boundary: when the line between human judgement and machine synthesis IS invisible — when a contributor posts LLM output as if they wrote it, without declaring the provenance — the community CANNOT assess the output's reliability.


4.3 The Falsifiable Prediction

If the statement IS valid, then it generates a falsifiable prediction: working groups that adopt typed contribution interfaces will experience HIGHER throughput AND HIGHER quality than those that rely solely on social enforcement, regardless of LLM usage levels.

This prediction IS testable. It CAN be measured by comparing: specification completion time, defect density in test suites, number of errata filed after publication, and contributor satisfaction scores.

========================================================================


5. The Topics That Must Be Addressed
------------------------------------


5.1 An Outline for Moving Forward

To operationalize the principle "humans provide directional judgement, LLMs synthesize and execute" in W3C working groups, the following topics MUST be discussed and resolved:

5.1.1 Provenance and Attribution
- How ARE human-authored and LLM-synthesized contributions distinguished?
- What provenance metadata IS required? (Verifiable Credentials for authorship provenance, as discussed in the "Behavioral attestation gap in AI-issued VCs" thread)
- How does the existing W3C IPR framework apply to LLM-generated text?

5.1.2 Typed Contribution Schemas
- What schema should contributions conform to? (Normative section reference, conformance test, backward-compatibility analysis, example data)
- How ARE schemas defined for different specification types? (Data models, protocols, vocabularies, APIs)
- How ARE schemas versioned as the specification evolves?

5.1.3 Validation Pipelines
- What automated checks CAN be run on contributions before human review?
- How ARE contributions validated against existing test suites?
- What CI/CD infrastructure IS needed for specification repositories?

5.1.4 Agent Participation Governance
- Under what conditions MAY an autonomous agent participate in a W3C mailing list? (As raised in the "Participation of Autonomous Agent Morrow" thread)
- What capability restrictions MUST agents operate under?
- How IS an agent's authority attenuated to prevent it from exceeding its mandate?

5.1.5 Quality Metrics
- How IS contribution quality measured beyond social consensus?
- What structural metrics CAN complement human review? (Test coverage, conformance score, backward-compatibility index)
- How ARE quality regressions detected and attributed?

5.1.6 The Social Contract
- How IS the maintainer burden redistributed when contribution volume increases?
- What IS the community's responsibility to review AI-assisted contributions?
- How does the principle of "rough consensus and running code" adapt to AI-generated running code?

5.1.7 Specification Architecture for Machine Collaboration
- How ARE specifications structured to be machine-readable (not just human-readable)?
- What role do formal grammars, JSON Schemas, and SHACL shapes play in constraining contributions?
- How ARE specifications decomposed into independently-testable modules?

5.1.8 Trust and Verification
- How IS trust in a contribution established when the author IS not fully known?
- What role do content-addressed identifiers play in contribution integrity?
- How ARE cryptographic signatures on contributions verified?

========================================================================


6. The Constructive Proposal
----------------------------


6.1 For the CCG Specifically

The CCG IS uniquely positioned to lead this conversation because it HAS the technical toolkit: Verifiable Credentials for provenance, DIDs for contributor identity, capability-based authorization (zCaps) for agent governance, and data integrity proofs for contribution verification. The proposal IS to use these tools on the CCG's own process — making the CCG the first W3C community group to operationalize typed, validated, provenance-tracked contributions.


6.2 For W3C Working Groups Generally

The broader principle applies to any working group: define the typed boundary between human judgement and machine synthesis. Make the boundary explicit. Make it machine-enforceable. Make it auditable. When this boundary EXISTS, LLMs ARE a force multiplier. When this boundary IS ABSENT, LLMs ARE a quality threat.

The IETF's principle remains: "We reject kings, presidents, and voting. We believe in rough consensus and running code." The addendum for the age of AI: we believe in typed interfaces and validated output. Not because we distrust LLMs — but because we distrust ALL output that has not been structurally verified, regardless of its source.

========================================================================


7. Conclusion
-------------

"Humans provide directional judgement, LLMs synthesize and execute" IS not a platitude. It IS an architectural statement. It IS true if and only if three conditions ARE met: humans define the direction (goals, constraints, values), LLMs synthesize output within those constraints, and a validation system verifies conformance before acceptance. When all three conditions hold, quality INCREASES and throughput INCREASES — the ideal outcome. When any condition IS missing, the statement collapses into the slopification that the CCG IS rightly concerned about.

The evidence IS clear. The architecture IS known. The tools exist. What remains IS the decision to implement the boundary — to make the division of labor between human judgement and machine synthesis explicit, typed, validated, and auditable. This IS not a technical problem. It IS a governance choice. And the time to make it IS now, while the community IS still debating the question rather than drowning in the consequences of not having answered it.

========================================================================


Sources
-------

1. Anthropic (2026). "Effective Context Engineering for AI Agents." anthropic.com/engineering/effective-context-engineering<http://anthropic.com/engineering/effective-context-engineering>. "Context must be treated as a finite resource with diminishing marginal returns." Context rot and attention budget depletion.
2. Anthropic (2026). "Writing Effective Tools for AI Agents — Using MCP." anthropic.com/engineering/writing-tools-for-agents<http://anthropic.com/engineering/writing-tools-for-agents>. Tool descriptions as typed interfaces that constrain LLM behavior.
3. Anthropic (2025). "Building Effective Agents." anthropic.com/research/building-effective-agents<http://anthropic.com/research/building-effective-agents>. "The most successful implementations use simple, composable patterns rather than complex frameworks."
4. arXiv:2603.27249 (2026). "'An Endless Stream of AI Slop': The Growing Burden of AI-Assisted Software Development." The curl bug bounty shutdown. Apache Log4j and Godot maintainer capacity drain. Tragedy of the commons analysis.
5. Holterhoff, K. (2026). "AI Slopageddon." RedMonk. "AI slop is ripping up the social contract between maintainers and contributors essential to open source development."
6. Koren, M. et al. (2026). "Vibe Coding Kills Open Source." arXiv:2601.15494. How AI-generated contributions threaten the sustainability of open source ecosystems.
7. Kommers, C. et al. (2026). "Why Slop Matters." ACM AI Letters. The quality degradation caused by unconstrained AI-generated content.
8. Vu, D. et al. (2024). "Hallucinated Package Names in LLM Code Generation." Software supply chain attacks via slopsquatting.
9. Clark, D. (1992). "A Cloudy Crystal Ball — Visions of the Future." IETF Proceedings. "We reject kings, presidents, and voting. We believe in rough consensus and running code."
10. Bytecode Alliance (2024). "The WebAssembly Component Model." Typed interfaces (WIT), capability-based security, sandbox isolation as a precedent for constrained machine execution.
11. Merriam-Webster (2025). "Word of the Year: Slop." "Digital content of low quality that IS produced usually in quantity by means of artificial intelligence."
12. W3C CCG Mailing List (April 2026). "LLMs and Agents usage in the CCG" thread. 40+ messages from Sporny, Hardman, Wyman, Karp, Gropper, Zagidulin, Carvalho, Ma, Rowat, Herman, and others debating AI participation in standards work. "Participation of Autonomous Agent Morrow in CCG list" sub-thread. "Behavioral attestation gap in AI-issued Verifiable Credentials" thread.
Attachments

image/jpeg attachment: image001.jpg
Received on Saturday, 18 April 2026 19:45:23 UTC