If the guardrail is a line in a prompt, the model can ignore it ... and eventually will

I learned this building an agent pipeline. An engineer agent would implement a feature, then a reviewer agent was supposed to check the work. More often than not, the review got skipped. So I added a line to the agent instructions: "All changes must be reviewed. Do not skip this step." It worked for a while. Then other instructions started getting skipped instead. I was playing whack-a-mole with an LLM.

The problem is structural. LLMs are probabilistic, so they don't follow instructions to the letter. Research shows accuracy drops over 30% when critical information lands in the middle of long context. The longer the agent runs, the less weight your prompt carries. Telling a model "never delete files" in a system prompt is a suggestion it is free to forget by step twelve.

Real guardrails live outside the thing they constrain. A pre-push hook that blocks force pushes. A linter that fails the build. A file permission that denies access. An interrupt layer that pauses an action until a human approves it. The agent cannot ignore these because the agent does not get to choose. The check runs in code, not in a prompt.

This is the same lesson every engineer learns about input validation. You don't trust the client, you validate on the server. With LLMs, the prompt is the client. If you want a constraint to hold, put it somewhere the model can't reach.