She Made On-Call Boring
For eight months, Yael Ben-David and her team built the machinery to make a 3 a.m. page feel like nothing. Boring was the goal. Here is what it took.

Editor’s note:The on-call problem gets discussed constantly and solved rarely. We wanted the story of what actually fixing it looks like, from the engineers who did it.
The page came in at 3:07 a.m. on a Wednesday. For most of the engineers on the on-call rotation at this mid-size fintech, that used to mean a familiar sequence: wake, unlock phone, open PagerDuty, feel the specific dread of a P1 at an hour when the brain has no defenses. Then begin the lonely work of figuring out what broke while the rest of the company slept.
That night, nothing happened like it used to. The page fired, but it was immediately triaged by an automated runbook that had already identified the affected service, pulled the last three deploy diffs, and posted a summary to a dedicated incident channel. By the time Yael Ben-David, the SRE on primary, opened her laptop, she had a diagnosis, two likely remediation paths, and a secondary on standby who had already been paged as backup. Not as an afterthought. As a designed part of the response.
"We spent about eight months building the machinery that made that night boring," Ben-David says. "Boring is the goal. Boring means we did our jobs." The shift she is describing is not just technical. It is cultural, a wholesale rethinking of what reliable systems require from the people who maintain them.
The heroism model of on-call, where reliability depended on a small number of people willing to absorb unlimited interruption, is being replaced by something more deliberate and, critically, more survivable. The change is being led disproportionately by women who have been clearest-eyed about what the old model was actually costing.
The data, when companies have bothered to collect it, is damning. Studies on on-call engineers have found significant correlations between unstructured after-hours alerting and burnout, attrition, and degraded cognitive performance the following day. The engineers who last longest in reliability roles are increasingly the ones at companies that have invested in reducing the cognitive load of being paged. Not just the frequency of pages. The quality of information that arrives with each one.
"There is a version of on-call that is just hazing," says Ndidi Okonkwo, an SRE lead who has helped redesign on-call practices at two companies. "You normalize it by calling it the job. The people who stay are the ones who are either very talented or very tolerant of being treated badly. You lose a lot of talent that way."
The new generation of reliability engineers is building systems that assume human limits rather than ignoring them. Escalation policies that actually escalate. Runbooks that get updated after every incident. Alerting thresholds that are revisited quarterly instead of set-and-forgotten. The goal is not fewer pages. It is pages that mean something when they arrive.
Ben-David's 3 a.m. is still a 3 a.m. The work is still hard. But the machinery around it has changed what it asks of her, and what she can give back when the sun comes up.
The old model was: if it pages, someone drops everything. The new model is: if it pages at 3 a.m., we didn't do our jobs earlier in the week.



0 comments on “She Made On-Call Boring”
Welcome to the comments section. We moderate every submission according to our community guidelines.
Loading conversation…