Version 9
Visibility: Open to anyone

    RCA published 2013-12-27:

    Incident Start Time

    2013-12-23 2:08AM PST
    Duration

    The first set of impacted sites started to recover at 9:40 AM PST, with all impacted sites recovered by 1:40 PM PST.

    User impactImpacted sites were either unavailable, or experienced limited or poor performance during this event.
    Executive SummaryEssential infrastructure switches failed due to a bug that caused them to be overloaded.
    Root CauseThe infrastructure switches experienced hash index collisions which caused problems with learning new MAC addresses in the forwarding database.  The number of hosts in use exceeded the rate of MAC learning / forwarding table insertion that the software version was able to handle, despite being well within documented hardware limits and running a recent version of the software.  This was not caused by any hard limits of the hardware for MAC or ARP learning.
    Remediation plan

    The immediate remediation was completed by upgrading the software on the switches during the outage. There are many additional action items to remove and/or minimize the impact for any events in the future. These include:

    • Maintenance on 2013-12-28 at 10pm PST to reinforce network redundancies
    • Working closely with vendors to enhance procedures to identify and prioritize critical patches
    • Implementing a new design to further segment failure domains
    • Further automating the manual recovery effort required during an incident of this scale
    • Improve communication procedures around incidents of this nature

     

        

        

        

     

    Historical updates below

    Update: The network issue has been patched and we are in a fully operational state.  We have now fully restored all affected sites, and UAT environments.  If you continue to see any issues with your Jive instance, please contact support and they will be happy to assist you.  Thank you again for your patience.


    Detail: Somewhere around 2:30am PST our NOC identified a sporadic network issue, seeing packet loss and loss of specific routes within our datacenter affecting substantial portion of our customers.  To troubleshoot, we moved traffic from our primary core switch to our secondary switch - which temporarily resolved the issue.  In parallel we worked with Juniper support to verify the bug we were encountering, and update our primary switch module.  As we were doing this, the secondary switch began to show similar signs of failure.  The networking team worked quickly to update the code on the primary switch, then forced all traffic through it.  Finally, the second switch was updated and traffic resumed normal operations.  The Hosted Operations and NOC team then worked to restore individual customer sites.  The specific bug we ran into was around hash index collisions causing issues with the learning of new MAC addresses into the Forwarding DB and causing specific race conditions.


    We sincerely apologize for and inconvenience caused by this outage and are working diligently to make sure this doesn't happen again.  Further, we'll be monitoring this group to answer any questions about the event.


    The Jive Hosting team is currently working on an issue that is impacting a subset of communities hosted in our US data center.  We are aggressively working to determine root cause. Updates will be provided at least every 30 minutes until this issue is resolved.

    (4:15P PDT)

    This incident is now closed. All affected UAT and Production sites have been restored. A full RCA will be performed, with a link added to this document when available. If you feel your site is still affected by this outage please submit a support case.


    (1:14P PDT)

    We have now fully restored all affected sites.  If you continue to see any issues with your Jive instance, please contact support and they will be happy to assist you.  Thank you again for your patience.


    (11:25A PDT)

    All hosted and 95% of the affected cloud customers are back online.  We are working through the last 5% of affected cloud customers now and expect that to be resolved in the next hour.

     


    (9:53A PDT)

    Many of you have reported your sites having been up, then going down.  Some sites continued to be functional through the network issue we had, often in a partially stable state.  In order to ensure full operation, we are having to restart some sites proactively.  We'll continue to post here as we finish these restarts.  Once completed, we'll give an all-clear - if any issues remain after that time, our support team will work to resolve the specific issue.


    (9:13A PDT)

    Update:  The network issue has been patched and we are in a fully operational state.  The Hosted Operations and NOC teams are working through customers to bring them up as soon as possible - we estimate this will be complete in the next few hours.  We'll continue to post updates here.  Below is a more technical description of what happened:


    Detail: Somewhere around 2:30am PST our NOC identified a sporadic network issue, seeing packet loss and loss of specific routes within our datacenter affecting substantial portion of our customers.  To troubleshoot, we moved traffic from our primary core switch to our secondary switch - which temporarily resolved the issue.  In parallel we worked with Juniper support to verify the bug we were encountering, and update our primary switch module.  As we were doing this, the secondary switch began to show similar signs of failure.  The networking team worked quickly to update the code on the primary switch, then forced all traffic through it.  Finally, the second switch was updated and traffic resumed normal operations.  The Hosted Operations and NOC team then worked to restore individual customer sites.  The specific bug we ran into was around hash index collisions causing issues with the learning of new MAC addresses into the Forwarding DB and causing specific race conditions.


    We sincerely apologize for and inconvenience caused by this outage and are working diligently to make sure this doesn't happen again.  Further, we'll be monitoring this group to answer any questions about the event.


    The Jive Hosting team is currently working on an issue that is impacting a subset of communities hosted in our US data center.  We are aggressively working to determine root cause. Updates will be provided at least every 30 minutes until this issue is resolved.


    (9:08A PDT)

    UPDATE: We are still performing recovery on all effected sites. The initial remediation provided limited restoration and we are working now to ensure full functionality of every site which may require a restart of a site that appears to be working (may only be a single node or other complications in the back end preventing full functionality).


    (8:39A PDT)

    UPDATE: The current event has been identified as a core networking bug. We have been actively engaged with the Vendor and performed the remediation steps outlined. The primary event was resolved by failing over the primary device to the secondary device which hit the same bug faster than anticipated. While performing the failover our network team was engaged with the Vendor and performed an upgrade to the primary device that has been identified to resolve the bug. The upgrade of the primary device has been completed and we have failed back to the primary device. Now that the core network issue has been remediated we are vigorously working to restart any effected sites.

     

    As a precaution we have stopped all of the UAT sites in the effected environments. We will bring the UAT infrastructure back online later in the day.


    (7:09A PDT)

    UPDATE: We are currently experiencing a secondary event to the primary event. We are actively engaged on remediation of the secondary event. I will provide an update in 15 minutes on the status of the secondary event.


    (6:09A PDT)

    UPDATE: We are continuing to work on an isolated portion of the environment and continuing to restart effected sites. An update will be provided once full service has been restored.


    (5:27A PDT)

    UPDATE: Some site restorations have been delayed as we recover all dependencies. An update will be provided once full service has been restored.


    (4:14A PDT)

    UPDATE: The source of the issue has been addressed and we are currently working to restore full connectivity for all effected sites. We anticipate full service restoration within the hour. An update will be provided once full service has been restored.


    (3:40A PDT)

    UPDATE: We have isolated the source of the issue and we are currently establishing the remediation plan. An RCA will be performed and linked from this document once service has been restored to the effected sites.