Today, custom domain tracking links and analytics ingestion experienced downtime from around 2:10 UTC to 14:25 UTC, which caused the links in emails from newsletters with tracking and custom sending domains enabled to be unavailable and email analytics for most newsletters to be delayed.
This was caused by the main domain for the service that handles these tasks being deregistered from Heroku, making the DNS targets previously set invalid. As a result, custom tracking domains and our main endpoint for recieving email events were left pointing to a CNAME that didn't exist.
The domain was deregistered by a new internal system that removes inactive custom domains from Heroku. We created this system because we've begun to reach the 1000 domain limit in Heroku apps, which was affecting our ability to add new custom domains. Unfortunately, this system had a bug where it didn't filter out the main domain for the service, which meant it deregistered it after it detected we've crossed a threshold.
How we detect the issue?
Unfortunately, our automated systems weren't monitoring custom domain tracking links or event ingestion from our email sending providers. We detected this issue from user reports, which is unacceptable.
This is particularily bad as we had a similar incident 3 weeks ago, but we failed to implement the monitoring fixes that we had planned before this incident.
How did we mitigate the issue?
At 14:20 UTC we added the domain again to Heroku, updated DNS and regenerated the ACM certificates. This quickly restored availability for the main domain and custom tracking domains.
How will we prevent this from happening again?
- We fixed the issue in our internal system that caused the problem in the first place.
- We're adding monitoring to alert when either of these systems (custom domain tracking and email event ingestion) fail or have lower traffic than usual. This is a high priority issue that we've assigned an engineer to work on ASAP.
- We're prioritizing migrating this service to another platform to replace Heroku as our main hosting provider. We had already started the migration process, and will be completing it soon.
