SRE Observability Platform Engineering Government AWS Critical Infrastructure

Healthcare.gov: From Email Hell to SRE Excellence

Transformed a mission-critical government application from weekly SEV1 outages to almost 0 SEV1 incidents annually by pioneering SRE practices and observability infrastructure.

June 15, 2017
Healthcare.gov: From Email Hell to SRE Excellence - Case Study Hero

Overview

I was hand-picked to lead the engineering team keeping Healthcare.gov operational during its most critical period. What started as a rescue mission evolved into a complete transformation of how the team approached reliability, monitoring, and incident response.

The Challenge

I have many experiences fire fighting on projects that went sideways but this project was something that really opened my eyes. People were sleeping on cots in the office because the system was crashing so much, it was basically all-hands-on-deck 24x7 just to keep things running.

The operational reality was chaotic:

  • Weekly SEV1 outages
  • Everything was managed via email: 10,000+ emails in my first week alone, and I was issued a ‘developer laptop’ on day 2 because the volume of email was so extreme, Outlook would eat all the space on a standard laptop in a few weeks!
  • Zero visibility into what was actually breaking

We were spending all our time putting out fires and had no time left to understand why the fires were starting.

The Solution

Recognizing that SRE practices were the answer, I convinced leadership that we need to rethink how we do things and took a product-led approach to observability.

Building the Platform Team

  1. Infrastructure Enhancement

    • Deployed New Relic for application performance monitoring (real and synthetic)
    • Implemented Splunk and make aggressive use of dashboards for log aggregation and analysis
    • Created custom dashboards for key components
  2. Developer Enablement

    • Taught engineers how to instrument their code properly
    • Created onboarding guides for observability tools
    • Built libraries and templates for consistent monitoring patterns
  3. Shift Left on Observability

    • Made monitoring a first-class feature in every story
    • Established user story requirements for proper instrumentation
    • Created preventive action triggers from system signals

The journey was not for the faint of heart. In the midst of constant fire fighting, we invested time setting up infrastructure, changing mindsets, and teaching teams that observability isn’t an afterthought. But we treated it like a product, with clear requirements, user stories, and measurable outcomes.

Results

The transformation was dramatic. In about a year:

  • Reliability: Reduced SEV1s from multiple a week down to 1 the following year
  • Quality of life: No more sleeping on cots, engineers are no longer burning out
  • Team capacity: Shifted from reactive firefighting to proactive engineering

The observability foundation became a core asset when I later led the team to perform a cloud migration of 6,000+ VMs. Because every system was instrumented, the migration was seamless—monitoring continued with very little interruption.


Key Takeaway

Observability is a core feature. Never, ever ship anything you can’t monitor. Instrument everything. Your future self—and the engineers who come after you—will thank you.