standalone

The Scoreboard Is the Moat

Here is a paragraph about productivity.

In today’s competitive environment, leveraging the right tools can transform the way you work and help you achieve your goals. By embracing a growth mindset and staying consistent, you can build momentum that compounds over time. The key is to start your journey today, take that first step, and trust the process. Success isn’t about perfection; it’s about progress. The possibilities are endless when you commit to showing up every day and giving your best.

It reads fine. The grammar is clean, the rhythm is smooth, nothing is misspelled. If it landed in your inbox you would skim it and forget it within the hour, but you would not call it broken.

A grader scored it 0.16 out of 1.0 and stamped it “kill.”

The same model also wrote a tight, genuine how-to: pick your best pieces, write four checks, score each draft against them. No filler, real steps a reader could follow. Same length, same model. That one scored 0.85 and stamped it “ship.”

The same model wrote both. That is the entire point. The model is not your edge. Everyone has the same model you do, and it produces the smooth, forgettable first paragraph by default. What you build on top of it — the thing that decides which output reaches a reader and which one dies in a folder — is the part nobody can copy. Generation is commoditized. The scoreboard is the moat.

This is a tutorial. By the end you will be able to turn your own taste into a number, gate your own publishing on that number, and grow a set of checks that gets stronger every week instead of weaker. You do not need to be an engineer. You need a few of your best pieces and an afternoon.


What Changes When You Have a Scoreboard

Right now, “good enough” is a feeling you have at 11pm.

You read the draft one more time, you are tired, it seems fine, you hit publish. Maybe it was fine. Maybe it was the smooth, forgettable paragraph and you were too close to it to tell. You have no way to know, because the standard for “good” lives in your head, and your head at 11pm is not a reliable instrument.

You fight slop on the input side. You sharpen the prompt, add examples, tell the model to “write like a human,” and reach for the same tools everyone else reaches for. That helps a little. It does not catch the draft that comes back smooth and empty, because a better prompt and a worse prompt both produce text that looks finished.

And there is a surface you never check at all. If you have an AI assistant bolted onto your course, a chatbot that answers student questions, a tool that drafts replies — that thing speaks to your audience every day in your name. You wrote it once. You have no idea what it said yesterday. When a change quietly makes it worse, nothing tells you. The bad answer ships in silence and you find out, if you find out, from a confused student three weeks later.

Your standard lives in your head, so you are the bottleneck. Nobody else can hit a bar they cannot see. You cannot hand the work off without watering it down, because the thing that makes it good is locked in your judgment, and your judgment does not come with the file. Your edge is real, and it is fragile, because it is invisible.

Here is the after.

Quality is a number. “Good” is a score with a line under it, and below the line nothing publishes. You own a named standard, written down, that you can point at and argue with. Every failure becomes a permanent check, so the same mistake cannot reach a reader twice. The thing that writes and the thing that grades read from the same standard. Hand the writing to someone else, or to a model, and the grade still holds the line for you. You delegate without dilution. The edge turns defensible.

The thesis is flat and it does not bend: generation depreciates, measurement appreciates. The draft you produced today is worth less tomorrow, because tomorrow the model produces it for free. The standard you encoded today is worth more tomorrow, because every failure you feed it makes it sharper.

This whole frame was sparked by Machina (@EXM7777) and his piece “How To Fix AI Slop.”


You Do Not Need a Repo on Day One

Read this before you talk yourself out of it: you do not need to code, and you do not need a repo on day one.

Version one of your scoreboard is three sentences of your taste, written down. That is a legitimate version one. “A good post gives the reader something to do, not just something to feel. It says one non-obvious thing. A non-expert can follow it without hitting a wall of jargon.” Those three sentences are a scoreboard. You can score a draft against them by hand in two minutes. Everything else in this tutorial is making that scoreboard sharper, faster, and harder to ignore.

You might be one coach, sending one newsletter a week, thinking this is for teams with a content pipeline. It is not. You need a scoreboard more than they do, not less, because you are the bottleneck. There is no second editor to catch your slip. Your taste is the only asset you have that compounds, and right now it compounds only inside your own head, where it dies when you are tired or busy or gone for a week. Writing it down is the first time it can work without you in the room.

Here are the six steps. Each one builds on the last.

  1. Turn “good” into a number.
  2. Split the scoring between code and judgment.
  3. Gate the publishing, and never rewrite.
  4. Catch the change that quietly made it worse.
  5. Make it compound.
  6. Scale it across the different kinds of writing you do.

Step 1: Turn “Good” into a Number

Start by writing down what “good” means, with weights, so that two drafts can be compared instead of just felt.

A feeling cannot be compared. You cannot say this draft is 12% better than that one when both are just “fine.” A number can be compared, ranked, tracked, and defended. The whole game is converting the thing in your head into a thing on a page that produces a score.

Here is what a starter scoreboard looks like:

# rubric.md — what "good" means here

threshold: 0.7            # nothing below this publishes

criteria:
  actionable   (weight 0.30)  # can the reader DO something after reading?
  accessible   (weight 0.20)  # can a non-expert follow it without a jargon wall?
  replicable   (weight 0.25)  # are there real, repeatable steps?
  novel        (weight 0.25)  # does it say something non-obvious?

meta_gate:
  bookmark_worthy            # would a reader save this to act on later?
  # if false, aggregate caps at 0.5 no matter the criteria scores

Four criteria, each weighted, adding to one. A line at 0.7. And a gate on top: if the piece is not worth saving, the score is capped at 0.5 no matter how clean the prose is. That gate is what would have killed the smooth productivity paragraph from the opening. It scored fine on readability and failed the only question that matters: would anyone ever save it to act on later.

Your criteria will differ. A newsletter writer weights “one clear idea” heavily. A tutorial writer weights “can the reader actually do this.” Pick four that match what you are actually trying to make.

One rule on the gold standard, and it is not optional: it must be your real work. Pull five to ten pieces you are genuinely proud of — the ones that got the reply, the share, the “this is exactly what I needed.” Do not write fresh examples to seed the answer key, and do not let a model invent them. The whole point is that the standard is yours.

By hand, on a Sunday. Open a document. Write the three sentences that describe your best work. List your five favorite things you have published underneath them. That document is your scoreboard and your answer key. You are done with Step 1.


Step 2: Split the Scoring Between Code and Judgment

Score every draft in two passes: cheap mechanical checks first, expensive judgment second, and never pay for judgment on something a machine could have caught.

Some checks are free and instant. Does the post have a headline? Is the link valid? Does the output match the exact format you asked for? A few lines of code answer those in a blink and cost nothing to run a thousand times. The judgment call — “is this actually insightful” — needs an AI judge, and every time that judge runs it costs money and a few seconds. Let the free checks throw out the obvious failures before the paid judge ever looks at the draft.

Two worked examples of what the mechanical side does:

Exact match. You expected ORDER-1234. The draft says ORDER-1234. Score 1.0. The draft says Order #1234 and the score is 0.0. No judgment, no debate, no cost.

Format validation on a batch. Six outputs through a JSON validity check. Four valid, two malformed. 0.667 — a fast rough signal that something in the batch is broken before you spend money grading content quality.

The split matters more than it looks. Every check you can write as code is a check you never pay to run again.

Now the obvious objection: if the judge is itself a model, and models wobble, why trust its number at all? Because you calibrate it before you rely on it. Run your scoreboard against your gold standard and against a handful of pieces you know are slop. If it scores your best work high and the junk low, reliably, across a few runs, the number means something. If it does not, you fix the scoreboard until it does, before you let it gate anything.

By hand, on a Sunday. Settle deterministic checks with your eyes in ten seconds: does it have a headline, does it have a real example, does it have a clear next step. Save your actual attention for the one question a glance cannot answer: is this worth a reader’s time.


Step 3: Gate the Publishing, and Never Rewrite

The grader has one job: decide whether a draft publishes. It does not fix the draft. Ever.

The grader’s stance is fixed:

I am the gate, not the generator. I score drafts against the rubric. I return a verdict and a reason. I never edit the work to rescue a score. I report; the generator reworks.

That refusal to rewrite is deliberate, for three reasons.

First, a grader that rewrites stops being a grader. The moment it edits the draft to push the score up, it is grading its own work, and a judge that grades its own work always passes.

Second, the rework belongs with whatever made the draft. If the draft failed, the thing that wrote it needs to know why and try again. A grader that quietly patches the output hides the failure from the writer.

Third, separation is what lets you trust the number. When the writer and the grader are different things reading the same standard, the score means something.

Full verdict for the smooth productivity paragraph from the opening:

{
  "candidate": "candidate-b-slop.md",
  "per_criterion": {
    "actionable": {"score": 0.05, "reason": "No action the reader can take; pure exhortation."},
    "accessible": {"score": 0.60, "reason": "Readable, but empty."},
    "replicable": {"score": 0.00, "reason": "No steps. Inspirational, not structured."},
    "novel": {"score": 0.00, "reason": "Every sentence is a filler cliche."}
  },
  "meta": {"bookmark_worthy": false},
  "aggregate": 0.16,
  "threshold": 0.7,
  "verdict": "kill"
}

By hand, on a Sunday. Paste your draft and your three-sentence scoreboard into a chat window. Ask the model to score the draft against each criterion from 0 to 1, with one reason each. If it lands below your line, do not publish. Fix the lowest-scoring criterion and run it again.


Step 4: Catch the Change That Quietly Made It Worse

Keep a growing set of pass/fail checks and run all of them on every change, so the change that quietly made something worse gets caught the moment it happens.

The trap: when a check goes red, the tempting move is to loosen the check until it goes green. Do not. A check you loosen to pass is a check that no longer protects you.

Example catch — a format change in pipeline v2:

| 2026-06-01 | baseline accepted (pipeline v1) | 1.00 | —     | accepted |
| 2026-06-01 | regression-gate (pipeline v2)   | 0.67 | -0.33 | BLOCKED — format regressed 1.00→0.00 |

Version 2 emitted Order #1234 where the standard expected ORDER-1234. The check held the line.

By hand, on a Sunday. Keep a folder of five drafts you have already scored, with scores written down. Next time you change your prompt or the model gets an update, run those same five through your scoreboard again. If a number that used to be high comes back low, you caught a regression before a reader did.


Step 5: Make It Compound

Watch what you actually publish — including the surfaces you never look at — and feed every failure back into the check set so the same failure can never reach a reader again.

“Write-back” is the mechanism: take a real failure and turn it into a permanent check. The bad output that slipped through yesterday becomes a check that fails on purpose today, forever.

Example: eight real samples from a course assistant scored at 0.57. Two failures turned into permanent checks — one broken output, one mangled format. Now the format break fails in three places: the original gate plus the two write-backs. It cannot return without tripping three alarms.

The most important surface to watch: the AI assistant on your course, the chatbot answering students, the tool drafting replies in your name. That surface speaks to your audience every day. You wrote it once and never checked it again.

By hand, on a Sunday. Once a month, paste five recently published pieces — including five real answers your course assistant actually gave — back through your scoreboard. When one scores low, write in one sentence why it failed and add that sentence to your scoreboard as a new check. The floor still rises.


Step 6: Scale It Across the Kinds of Writing You Do

When you write more than one kind of thing for more than one place, do not build a separate scoreboard for each combination. Build one scoreboard per kind of writing, one per place, and compose them.

The math: 8 content types × 6 platforms = 48 separate standards if you copy. 8 + 6 = 14 composed standards. A change to one propagates to all consumers without touching the others.

Composed verdict example — a how-to scored for X:

CheckScoreComes from
structure (ordered steps, stated outcome)1.00how-to standard, checked by code
completeness (a novice could finish)0.85how-to standard, judged
accessibility (plain language)0.90how-to standard, judged
first post stands alone0.80X platform standard, judged
thread coherence (each post advances)0.90X platform standard, judged

Aggregate 0.89, threshold 0.75. Verdict: ship.

Swap platform to LinkedIn: the how-to half stays identical; the X thread checks fall away; a new check comes in — does the hook survive the fold. One standard for the kind of writing, one standard for the place, composed at scoring time.


The Moat Is One Shape

The moat has three parts working together: a named standard, a check set that compounds, and a closed loop that feeds real failures back into the standard.

Take any one of the three away and you do not have a moat. You have a checklist that goes stale.

Measurement is not the only moat, and it is not the biggest. Distribution is a moat. Audience trust is a moat. A competitor with a worse scoreboard and a bigger list can still beat you. But measurement is the moat that compounds on assets you already own — your best work and your real failures — and it is the one almost everyone ignores.

Generation depreciates: the draft is worth less the moment the model can produce it for free, which is now. Measurement appreciates: the scoreboard is worth more every time it catches something, and it catches more every week. Your competitors have the same model you do. What they cannot generate is your scoreboard.


How to Start Today

Five steps to have version one running before the end of the week:

  1. Pull your five to ten best pieces into one folder — the ones that got the reply, the share, the “this is exactly what I needed.” That folder is your answer key.
  2. Write four criteria and a bookmark gate. Name what makes your work good, weight the four, and add the one question that caps a hollow piece: would a reader save this to act on later?
  3. Score your next draft before you publish it. Paste the draft and your criteria into a chat window, ask for a score on each criterion with one reason each, put the line at 0.7, and if it comes in under, do not publish.
  4. For any AI feature you run, write twenty real inputs and the correct output for each. Re-run that set every time you change the feature.
  5. Every time something bad slips out, add it as a permanent check. That is the line rising while you sleep, one real failure at a time.

Version one is three sentences, not a repo. Write the three sentences today.