The TLS Certificate Outage Is Preventable: A Checklist

Automated renewal has been mainstream for a decade, and certificates still expire in production at companies with whole teams paid to stop it. Microsoft Teams, Spotify, Google Voice, and the Bank of England have each gone dark behind a lapsed notAfter date. The fix is not more automation. It is watching the thing that actually breaks, from the outside, and knowing the handful of places automation never reaches.

Why ACME does not save you

Every X.509 certificate carries a notBefore and a notAfter timestamp. Past notAfter, clients reject the handshake and the service is down — no grace period, no warning to the visitor beyond a full-page browser error. ACME clients like certbot are meant to renew well before that, and for a single well-behaved host they usually do.

The outages come from everything ACME does not cover:

Manually issued certs — an EV cert, a vendor-pinned cert, a one-off bought through a portal. Nobody wired up a renewal job, so it simply runs out.
Internal CAs. Certs on internal load balancers, mTLS between services, and admin panels rarely sit behind public ACME, so they age silently.
Forgotten subdomains. Renewal covers the apex and www; the marketing microsite or a regional host on a separate stack is in nobody's config.
Broken renewal nobody noticed. The cron ran, hit a DNS or rate-limit error, exited non-zero, and no one watched exit codes. The old cert was issued months ago, so there is no immediate symptom.
Expiring intermediates and roots. The leaf is fine and the chain above it lapses — more on this next.

Check the served chain, not just the leaf

A browser does not trust your certificate because it is valid. It trusts it because it can build a path from your leaf up to a root it already trusts, with every certificate on that path in date. Your leaf can have 60 days left while an intermediate above it expires tomorrow, and the connection still fails.

This is not hypothetical. On 30 September 2021 the DST Root CA X3 root that Let's Encrypt cross-signed against expired, and a wave of older clients began rejecting certs whose leaves were perfectly current. The leaf was never the problem; the path was. Let's Encrypt's own integration guide tells you not to hardcode intermediates for exactly this reason, since intermediates change.

Two practical consequences. First, a misconfigured server can serve an incomplete chain — leaf only, no intermediate. Desktop browsers paper over this by fetching the missing intermediate from the cert's AIA extension, or by reusing one they cached from another site, so it works on your laptop and fails for a first-time visitor, a mobile client, curl, or your own monitoring. Second, you have to verify what the server actually sends on the wire, not what is in your renewal directory.

Watch it from outside

Internal monitoring reads the cert file on disk. That tells you what should be served, not what is. If a load balancer terminates TLS with a stale config, or a CDN edge carries its own cert, the file on your origin is irrelevant. The only honest test is to connect from the public internet and inspect the certificate the server presents during the handshake: its expiry, its SANs, and whether the chain it sends builds to a trusted root.

Our uptime + TLS checker does this from the outside. It fetches a URL the way a visitor would and reports status, redirects, latency, the certificate's validTo, days remaining, SANs, issuer, and whether the served chain is trusted. It is a spot check, not a 24/7 monitor — ephemeral and rate-limited — so wire a recurring job into your own infrastructure for continuous coverage and use this when you want a fast, neutral second opinion from off your network.

From a shell you can pull the served expiry directly:

echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null | openssl x509 -noout -dates

Add -showcerts to dump the full chain the server sends, then check each certificate's dates — not only the leaf.

Find the certs you forgot

You cannot monitor a host you do not know about. Public Certificate Transparency logs are an append-only record of nearly every certificate issued for your domains, because CAs are required to log them and browsers reject certs that are not logged. Querying CT for your apex domain surfaces subdomains and one-off certs your inventory missed — including the regional host nobody added to a renewal config. Treat it as discovery: feed what you find back into your expiry monitoring.

The 47-day horizon

The window is shrinking. The CA/Browser Forum voted in 2025 to step maximum certificate validity down from today's 398 days: 200 days from 15 March 2026, 100 days from 2027, and 47 days from 15 March 2029. Domain-validation reuse drops alongside it, reaching 10 days by 2029.

The point of short-lived certs is that a compromised or misissued cert ages out fast without leaning on revocation, which has always been weakly enforced. The tradeoff is blunt: manual renewal stops being viable. A job you ran once a year by hand will run roughly eight times a year, and any cert outside automation expires before you remember it exists. Short lifetimes punish exactly the manual and forgotten certs that already cause most outages.

Automation issues the cert. External monitoring proves it is actually served, in date, with a complete chain. You need both, and they are not the same job.

The checklist

Inventory every cert — internal CAs, mTLS, appliances, vendor portals — not just what ACME manages.
Sweep CT logs for your domains to find hosts and certs you forgot.
Monitor expiry from outside your network, not from the file on disk.
Validate the full served chain — leaf, intermediates, root — every certificate's dates, and that the chain is complete.
Alert on days remaining (say, 21 and 7), and separately alert when a renewal job exits non-zero. Issuance succeeding is not the same as the new cert being served.
Assume short lifetimes now. Anything renewed by hand will not survive the move to 47 days.