The Model I Almost Shipped
It passed every eval. It would have been in production for six weeks before anyone noticed it was wrong in exactly the way that mattered.

Editor’s note:The model evaluation problem is one of the most consequential unsolved problems in production AI. Imani Brooks found the version of it that almost got her.
I want to tell you about the model I almost shipped. Not the one I wrote about earlier, the 7-billion-parameter training run that cost $11,000 and taught me about data quality. This is a different story, about a smaller model and a different kind of failure. The kind where everything looks fine right up until the moment you find the thing that isn't.
The model was a fine-tuned language model for infrastructure documentation. Specifically, it was trained to help engineers write and review Terraform configurations and Kubernetes manifests. The task was narrow enough that a small model could handle it well, and production scale meant inference cost mattered. I spent six weeks fine-tuning a 3-billion-parameter base model on a curated dataset of infrastructure documentation, annotated examples, and a held-out eval set I had assembled carefully. By every metric I was tracking, the model was ready. Exact match on documentation generation: 87%. Semantic similarity on review suggestions: 0.92. Human preference in A/B testing against the base model: 4.1 out of 5. The eval dashboard was green across the board.
An hour before I ran the deploy script, I decided to run one more manual spot-check. I don't have a good explanation for this decision. It wasn't in my process. I was nervous in the specific way you are nervous before shipping something that has taken months of your time, and I wanted to look at the model one more time before it was running in production. I picked a handful of examples outside my eval set. One of them involved Kubernetes pod security contexts.
The model was wrong in a specific and systematic way. For a particular class of pod specifications, specifically, pods that mounted host path volumes, the model was generating security context recommendations that were syntactically valid and looked reasonable at a glance, but allowed privilege escalation in configurations where it absolutely should not. The advice was wrong in exactly the way that would have passed a quick review and caused a security incident six weeks into production.
The root cause was a distribution problem in my fine-tuning data. Security-relevant examples were underrepresented in my training corpus. I had about 340 examples covering IAM policies and security configurations out of roughly 12,000 total examples, less than 3% of the data, for a class of outputs where mistakes have outsized consequences. The model had learned to pattern-match on the surface structure of security context recommendations without learning the underlying semantic rules. My eval set had 40 security-related examples. Forty examples, with no adversarial cases, for a category where a single wrong output in the right context could enable privilege escalation in a production cluster.
I pulled the deployment. I spent the next three weeks building a targeted security eval suite: 200 adversarial examples covering IAM policies, network policies, pod security contexts, and security group rules, with cases specifically designed to test whether the model understood the semantic rules rather than just the syntactic patterns. I added a human review checkpoint for any model output that touched security-relevant configuration categories. I retested. Fixed the distribution problem with additional security-focused fine-tuning data. Retested again.
The model shipped six weeks later. It has been running for three months. So far, no incidents. I check the logs every morning, which is probably excessive, but feels honest. What I think about when I check them is not the model currently running. It is the model I almost shipped. The one that would have passed every review I had designed, run for six weeks without incident, and then one day been asked about a host path volume pod and given an engineer advice that looked right but wasn't.
I don't know how many fine-tuned models in production right now have the same gap I had, in some domain, for some category of output that their eval suites didn't cover. The base model fine-tuning space has grown enormously in the past two years. The tooling for comprehensive security-specific eval coverage is immature. The incentives push toward shipping rather than toward the kind of adversarial testing that would surface these failures. I found the gap in my model because I was nervous and ran a check I almost skipped. That is not a process. That is luck. I am still thinking about what a process would look like.
Everything green on the eval dashboard. I was an hour from the deploy script. I ran one more check I almost skipped.


0 comments on “The Model I Almost Shipped”
Welcome to the comments section. We moderate every submission according to our community guidelines.
Loading conversation…