At this point all queues are clear, mail should be flowing normally without delays.
I wanted to update everyone on the situation. This was completely resolved over the weekend and we had no email delays. This morning at 4am we had a flood of emails destined to an external mail server that was not accepting connections, which in turn backed up our mail queues again. Emailed continued to flow through but was delayed. We have isolated the queues that are having issues and new mail is currently flowing with no delays. The existing queues should be processed in the next 1-2 hours.
Starting on 8/17, we detected an issue with one of our outbound mail cluster which caused a delay in delivery for some hosted customers. Here are the technical details:
There are multiple mail servers in this mail delivery cluster, front-ended by a load balancer. One of those servers fell into a state where it was accepting mail, but didn't seem to be attempting to deliver it - effectively just queueing the outbound email. This caused a domino affect where the mail queue for all servers became severely backed up. We disabled the node from the load balancer to give it a chance to clean up it's queue and added more capacity to handle the backlog. We still have a backlog of mail, but is quickly catching up and should be complete by this weekend.
Following the detection of the issue, we made the following changes across all the nodes:
* Tuned the mail server configuration
* Increased hardware (cpu/memory)
* Performed a full review of the current monitoring and will be adding additional monitoring to detect this type of failure
Further, we plan to move our mail cluster to physical hardware and continue to increase capacity. We sincerely apologize for any inconvenience this may have caused and rest assured that Jive takes this issue very seriously. After the above changes, we will have a much more robust and scalable outbound mail system.