Inbound emails bounced from CloudFilter
Incident Report for Mailprotector
Postmortem

Additional resources to the filtering infrastructure were added at approximately 4:30 pm ET on Thursday, December 1, 2022. Roughly 11% of emails received through CloudFilter either bounced or deferred during:

  • 12/1/22 4:30 pm to 6:00 pm
  • 12/2/22 5:30 am to 2:00 pm

The additional resources had misconfigurations which caused a failure to deliver email to the next hop in the CloudFilter mail flow. The resources were added to resolve a resource constraint pattern observed at the top of the hour during early business hours.
Email messages that entered the queue on the new resources were locked into the Postfix process of performing SMTP retries until reaching the bounce threshold or delivery. The bounce threshold was 12 hours, creating a delay in observing the problem.
We estimate that 54,000 messages across the above timeframe were bounced.

Detection
The incident was detected when Partner Success researched tickets that seemed to correlate to the mail delivery issues through the newly-deployed resources. Partner Success escalated the issues to the operations engineer, who gathered information and escalated it to the on-call engineer.

Recovery
The issue was resolved by removing the new resources from production service. However, several attempts were made to resolve the issue at different points throughout the outage window.
Ultimately, the issue was resolved by adding new external IP addresses to Postfix configurations and the transport servers' security group.

Next Steps
The incident exposed gaps in the documentation of legacy mail infrastructure and processes for rolling out changes to the infrastructure.
Several changes in the process will be implemented, including but not limited to:

  • Comprehensive run book for adding new resources to the CloudFilter cluster

  • Account for identified "blind spots"

  • New staging environments for validating changes

  • Additional key metrics for post-deployment observation

  • A review of the data collected during this incident is ongoing

  • Determine metrics that were missing

  • Different alerting or notifications to get ahead of partner reports from tickets

The incident and affected resources are resolved. However, the team will continue implementing processes to prevent a repeat of mail flow performance problems. The effort does not end with resolving this incident, but rather a refocusing on the iterative improvement of stable infrastructure management.

Anticipated FAQs

  • Can the bounced emails be resent?

  • No. SMTP (Simple Mail Transport Protocol) does not keep an email after it is bounced. It is removed from the queue.

  • Can I receive a list of emails that were bounced?

  • Unfortunately, no. The logs are not organized in a way to pull that information together. Individual log details show the SMTP response in the timeline, a manual process in the Console.

Posted Dec 05, 2022 - 15:54 EST

Resolved
Additional resources to the filtering infrastructure were added at approximately 4:30 pm ET on Thursday, December 1, 2022. Roughly 11% of emails received through CloudFilter either bounced or deferred during:
12/1/22 4:30 pm to 6:00 pm
12/2/22 5:30 am to 2:00 pm

The additional resources had misconfigurations which caused a failure to deliver email to the next hop in the CloudFilter mail flow. The resources were added to resolve a resource constraint pattern observed at the top of the hour during early business hours.

Email messages that entered the queue on the new resources were locked into the Postfix process of performing SMTP retries until reaching the bounce threshold or delivery. The bounce threshold was 12 hours, creating a delay in observing the problem.

We estimate that 54,000 messages across the above timeframe were bounced.
Posted Dec 01, 2022 - 17:30 EST