I Stopped Trying to Prompt My Way Out of AI Slop. I Built a Scoreboard Instead.

A working session where an X post about “fixing AI slop” became a real, running quality gate — one that blocked a bad change, caught a silent production dip, and turned every failure into a permanent test. Here’s the whole build, the verbatim scores, and why a measurement layer is the thing standing between your brand and the slop machine.

The Reframe That Started It: Stop Fixing the Input

Slop is an output-side problem. Better prompts are a better coin flip, not a gate. You improve the odds that the first draft is good. You do not guarantee it. You do not catch the draft that comes back smooth and empty, because a better prompt and a worse prompt both produce text that looks finished.

The frame that cracked it open came from Machina (@EXM7777) and his piece “How To Fix AI Slop.” The argument: the problem is not on the input side. The problem is that nothing stands between the model’s output and your audience. The fix is not a sharper prompt — it’s a gate.

A gate scores every draft on a 0–1 scale against criteria you defined, holds the line at a threshold, and does not let anything below the line reach a reader. That is the reframe. Stop trying to produce better inputs. Start measuring your outputs.

Building the Gate as a Folder, Not a Platform

The gate lives in an eval-loop/ directory. No platform, no subscription, no setup overhead. Files:

eval-loop/
  harness.yaml          orchestration config
  AGENTS.md             what each agent is allowed to do
  SOUL.md               the gate's stance (never rewrites, only scores)
  memory/
    rubric.md           your criteria + weights + threshold
    gold-standard/      your 5–10 real best pieces
  skills/
    judge/              the AI judge for judgment-requiring checks
  tools/
    metrics.py          deterministic checks (free, instant)
  inbox/                drafts waiting to be scored
  scored/               approved drafts
  rejected/             killed drafts, untouched
  suite/                growing pass/fail check set
  HEARTBEAT.md          running score line + audit trail

The split between tools/metrics.py and skills/judge/ is the load-bearing design choice. Deterministic checks — does it have a headline, is the link valid, does the output match the expected format — run in metrics.py for free, every time. Judgment checks — is this actually insightful, would a reader save this — go to the AI judge, which costs money and time. You never pay for judgment on something a machine could have caught.

Running It Live: A 0.85 and a 0.16

Two drafts, same model, same session.

The slop draft — a paragraph about productivity. Grammar clean, rhythm smooth, nothing misspelled. Scored 0.16. Verdict: kill.

{
  "candidate": "candidate-b-slop.md",
  "mode": "content",
  "per_criterion": {
    "actionable": {"score": 0.05, "reason": "No action the reader can take; pure exhortation ('start your journey today')."},
    "accessible": {"score": 0.60, "reason": "Readable, but empty — nothing to actually understand."},
    "replicable": {"score": 0.00, "reason": "No steps. Inspirational, not structured."},
    "novel": {"score": 0.00, "reason": "Every sentence is a filler cliche; reader learns nothing new."}
  },
  "meta": {"bookmark_worthy": false, "reason": "Nobody saves this to implement later — caps aggregate at 0.5."},
  "aggregate": 0.16,
  "threshold": 0.7,
  "verdict": "kill",
  "flags": ["filler phrases: 'at the end of the day', 'it's worth noting', 'in today's fast-paced landscape', 'the possibilities are endless'"]
}

The real how-to — pick your best pieces, write four checks, score each draft against them. No filler, real steps. Scored 0.85. Verdict: ship.

Actionable 0.90, accessible 0.85, replicable 0.90, novel 0.75, bookmark gate true.

The same model wrote both. That is the point.

The Gate Blocking a Change — For Real

The regression gate is the check that catches the change that quietly made things worse. Version 2 of a pipeline changed the output format from ORDER-1234 to Order #1234. The aggregate dropped from 1.00 to 0.67. The gate fired:

| 2026-06-01 | baseline accepted (pipeline v1) | 1.00 | —     | accepted |
| 2026-06-01 | regression-gate (pipeline v2)   | 0.67 | -0.33 | BLOCKED — 03-format regressed 1.00→0.00; held for rework |

The check held the line. The broken version did not ship. The audit trail recorded exactly what changed and why it was blocked.

This is what the gate is for. Not a better prompt. A memory of what “correct” looks like, maintained as a file, enforced on every change.

The Loop That Gets Harder to Fool While You Sleep

Eight real outputs from a course assistant scored at aggregate 0.57. Two failures turned into permanent test cases: one broken output, one mangled format. Now those checks live in suite/ and run on every subsequent change. The format break fails in three places: the original gate plus the two write-backs. It cannot return without tripping three alarms.

This is the compounding mechanism. Every failure that reaches the gate gets added to the suite as a permanent check. The floor rises each time. The gate that started as four criteria and a bookmark gate grows into a full test suite over weeks, shaped entirely by real failures from your real work.

The most important surface to watch: the AI assistant bolted onto your course, the chatbot answering student questions, the tool drafting replies in your name. That surface speaks to your audience every day. You wrote it once. Without a gate, you have no idea what it said yesterday.

The Decision That Protects Your Brand: The Gate Scores, It Never Rewrites

The gate’s stance is fixed. It is worth writing down where you can see it:

I am the gate, not the generator. I score drafts against the rubric. I return a verdict and a reason. I never edit the work to rescue a score. I report; the generator reworks.

Three reasons this is non-negotiable:

First: a grader that rewrites stops being a grader. The moment it edits the draft to push the score up, it is grading its own work. A judge that grades its own work always passes.

Second: the rework belongs with whatever made the draft. If the draft failed, the thing that wrote it needs to know why and try again. A grader that quietly patches the output hides the failure from the writer, so the writer never improves and the next draft fails the same way.

Third: separation is what lets you trust the number. When the writer and the grader are different things reading the same standard, the score means something.

If you want automatic rewriting, use two folders. The grader reads from inbox/ and writes its verdict to scored/ or rejected/. It never edits in place. The rewriting, if it happens, is a separate step that picks up the rejected draft and tries again. The gate stays clean.

One Judge, Many Standards: Genre Plus Platform

The scale trap: 8 content types × 6 platforms = 48 separate standards if you copy. 8 + 6 = 14 composed standards if you build one file per content type and one file per platform, then compose them at scoring time.

A how-to scored for X (how-to ⊕ platform:x):

Check	Score	Comes from
structure (ordered steps, stated outcome)	1.00	how-to standard, checked by code
completeness (a novice could finish)	0.85	how-to standard, judged
accessibility (plain language)	0.90	how-to standard, judged
first post stands alone	0.80	X platform standard, judged
thread coherence (each post advances)	0.90	X platform standard, judged

Aggregate 0.89, threshold 0.75. Verdict: ship.

Swap to LinkedIn: the how-to half stays identical. The X thread checks fall away. A new check comes in — does the hook survive the fold. One standard for the kind of writing, one for the place, composed at scoring time. A change to the how-to standard propagates to every platform without touching any platform file.

What This Unlocks

For newsletters: score every draft before it sends. Paste the draft and your rubric into the gate. Anything below 0.7 goes back.

For AI chatbots and course assistants: sample 5–10 real outputs monthly. Score them. When aggregate drops, the suite catches it before a student does. Write every failure back as a permanent check.

For sales pages and offers: the bookmark gate is the first filter. If the page isn’t worth saving to act on later, it cannot compensate with clean prose. Score before you run traffic.

Key Takeaways

Slop is an output-side problem. Better prompts improve the odds; a gate holds the line.
Split scoring: deterministic checks in code (free, instant), judgment checks in the AI judge (paid, reserved for what code can’t catch).
The gate scores. It never rewrites. The moment it edits the work, you lose the only thing the gate was for.
Every failure becomes a permanent check. The suite grows stronger every week.
Compose genre and platform standards — don’t duplicate them. 14 files beats 48.

How to Start

Pull your five to ten best pieces into one folder. That is your answer key.
Write four criteria and a bookmark gate. Weight the four. Add the one question that caps a hollow piece: would a reader save this to act on later?
Score your next draft before you publish. Paste the draft and your rubric into a chat window, ask for a score on each criterion with one reason each. Line at 0.7. Below the line, do not publish.
For any AI feature you run, write twenty real inputs with correct outputs. Re-run that set every time you change the feature.
Every time something bad slips out, add it as a permanent check. That is the floor rising while you sleep.

Version one is three sentences, not a repo.