Dear Jive Community,
I’m writing this to bring more visibility and clarity to the outage that occurred on Friday, January 14, 2011. This event shed light on the numerous ways we can improve how we communicate with our customers in these situations. We are taking immediate steps to ensure timely and consistent communication in the future and will keep you informed as we implement these steps.
Jive has received the preliminary root cause analysis from SunGard that identifies the primary and secondary issues that caused the outage on Friday, January 14, 2011. The primary cause behind the outage was the memory module failure in the NetApp storage array.
The failure of the memory module triggered the storage array to perform a failover to the redundant hardware. Due to a configuration error on the redundant hardware, the failover was not successful and the outage occurred. We have tested this failover process successfully multiple times with the previous configuration and are working with SunGard to investigate the who/why/when around the improper failover configuration change. We will implement the necessary process and control improvements to ensure this does not happen again.
After the outage occurred, SunGard successfully replaced the memory module in order to fix the problem. SunGard has also successfully replicated the error in a lab environment and provided Jive with a set of recommended configuration changes and implementation plan for review. The implementation and testing of the recommended changes will be scheduled during the standard maintenance window on Saturday, January 22nd between 10 PM and 3 AM Pacific.
Jive and SunGard executive team members met in person on Monday, January 17th to review the outage and discuss go forward actions.
Jive has also started a process with SunGard and its hardware provider to perform a complete assessment of the current state of procedures and technology associated with the hosted environment.
Jive Hosting management will perform an onsite assessment of SunGard operational processes and communication protocols during the week of January 17. This includes:
- Review of monitoring, alerting and logging systems, configurations and protocols;
- Assessment of Service Desk and Incident Response procedures;
- Review of Service Level agreements and contracts with hardware and software vendors;
- Review of Change Management processes;
- Complete audit of all hosted technology at the SunGard data center;
- The visual inspection of all hardware and facilities;
- Validation that all documentation is current and comprehensive; and
- Identification of single points of failure.
Jive is committed to providing a world class-hosting environment. We will continue to work with SunGard and other partners to ensure the appropriate processes and technology are in place to support our customers. We intend to implement any changes in process and technology based on the lessons learned during this outage and the assessment is currently underway. We will make every effort to prevent this type of interruption from occurring in the future.
We plan to keep you informed and share additional information as it becomes available.
If you have any further questions please contact firstname.lastname@example.org.