A CI failure playbook for a one-person Rails project

I’ve been treating CI on blog-manager like a personal automated nag: green check or fix it. That works fine right up until something breaks on main and I have to remember what my own rules are. So back in early May I sat down to write the rules out as docs/ci.md, a CI failure playbook for a project where I’m the entire team. This is a snapshot of that day; the playbook has grown since.

Three things had to be in there:

What CI actually runs.
What I do when it goes red.
What “merging is blocked” means without branch protection.

That third one was the surprise.

The plan, before I started

A Plane issue listed five decisions to capture: branch protection, flake policy, notifications, local repro, rollback rule. I assumed branch protection would be the easy one. Flip the toggle, require green CI on main, done. GitHub disagreed.

Upgrade to GitHub Pro or make this repository public to enable this feature.

GitHub doesn’t enforce required status checks on private repos under personal accounts unless you pay for Pro. The repo is private (it has a Gitea registry password and a few API tokens in seeded fixtures), and I’m not flipping it public for a $4/month feature. So branch protection went into the playbook as deferred, with rationale:

Branch protection status: deferred. GitHub Pro is required to enforce required status checks on private repos under a personal account. The cost isn’t justified for a solo-dev project. Revisit if the repo moves to a GitHub org or goes public.

That means the merge gate is self-enforced. The CLAUDE.md rule (never merge a red PR) and a five-line note in docs/ci.md are the gate. It’s a load-bearing comment, but it’s accurate. I’m the only one who can break it, and the only one who has to live with the breakage.

Flakes: skip plus an issue, no retry loops

Old habit: re-run a flaky job until it goes green, then move on. The new rule is one manual retry. If it still fails, quarantine it.

def test_something_flaky
  skip "flaky - see BLG-123"
  # ...
end

The trick is making the skip visible. A skip with an issue tag on it shows up in test output and on the issue board. A re-run loop hides the problem in run history that nobody checks. I’d rather have an obvious yellow than an invisible green.

I almost shipped a version where skip sat at the class level, before def. That raises NoMethodError instead of skipping, because skip only works inside a test body. Code review caught it. Worth flagging because it’s an easy foot-gun.

The rollback rule I rewrote the same day

My first version of the playbook had a flat rule: if main is red and a clean fix isn’t ready within an hour, revert the offending commit. git revert <sha>, push, merge, fix properly on a branch. Simple.

Then I read it back and it was wrong for this project. blog-manager isn’t in production yet. There are no users to protect from a red main. Most of my breakages are half-finished infrastructure: a runner that isn’t registered, a missing dependency, a workflow I’m still wiring up. Reverting those hides the problem instead of solving it. For a pre-production solo app, the honest default is fix forward.

So I rewrote the section a few hours later. The 1-hour revert rule still exists, but it only kicks in when all three of these are true: the app is in production with real users affected, a clean fix will take more than an hour, and the breakage is user-facing. Until then, fix forward.

The reason the rule is written down at all is the temptation the moment you’ve broken main: “I’ll just push the fix in a minute.” Sometimes that minute is two hours later, and now bisects are harder and any new commit lands on a broken base. Writing down the condition is how I stop myself from negotiating with it at 2am.

Local repro that mirrors CI

The pre-push check I actually run:

bin/brakeman --no-pager
bin/bundler-audit
bin/importmap audit
bin/rubocop
bin/rails db:test:prepare test
bin/rails db:test:prepare test:system

Six commands, about 90 seconds on this laptop, mapping to the five jobs CI runs in parallel (Brakeman and bundler-audit share one job). The point isn’t to replace CI. It’s to catch the dumb stuff, a stray RuboCop violation or an unused import, before the runner has to.

What I’d do differently

The playbook started short, about 80 lines of Markdown. That’s intentional. The longer it is, the less I’d actually read it during a red-main moment. The next thing I want to add is a “first 5 minutes” cheat-sheet at the very top: the one or two gh commands that let me decide flake-vs-real without scrolling.

The other follow-up was auto-deploy to staging on every merge to main, which I built the next day. It got its own section in the playbook, which is most of why the doc has roughly doubled in length since.

A CI failure playbook for a one-person Rails project

The plan, before I started

Flakes: skip plus an issue, no retry loops

The rollback rule I rewrote the same day

Local repro that mirrors CI

What I’d do differently

Related reading

Building blog-manager, part 1: the Blog model and a UI that doesn't look like a scaffold

Starting a dev log