Given the recent major outages, periodic node outages, and succession of weekend maintenance events, I wanted to have a central communication vehicle to communicate what we’ve done, are doing, and will continue to do in order to remediate the situation. What better place than in a social business software product…..I apologize in advance for this being a War and Peace like blog post but felt it was important to be transparent and complete. We will utilize this group going forward to share our plans with you.
Summary at a Macro Level
I am sure every one of you probably care most about 3 things:
- How did we get here?
- What is the current situation?
- How is Jive going to make it better?
In the back half of 2010 Jive saw unprecedented growth in the users and engagement within its existing customer instances, a shift to true mission critical usage of its software, and the addition of some of the largest enterprises and customer communities in the market. This dramatic increase in usage, storage, bandwidth, and scale exposed weaknesses within Jive’s organizational processes and hosting architecture.Follow the link to this document to understand the major areas and details we are focusing on w/r/t the above: https://community.jivesoftware.com/docs/DOC-3784
As we resolve this situation, our commitment to you is to adhere to the following principles for everything we do:
- Be Transparent
- Be Responsive
- Produce Results
Recap of the 2 major outages for anyone who may not have seen them yet are here:
Outage #1 - January 14, 2011: https://community.jivesoftware.com/community/jivetalks/blog/2011/01/18/update-on-friday-s-outage-january-18-2010
Outage #2 - February 10, 2011: https://community.jivesoftware.com/community/support/blog/2011/02/11/update-on-service-interruption-on-thursday-feb-10
Both of these were due to 2 fundamental issues:
- Human error
- Technology failure
Both of these will always happen in technology environments. That the reason you have change control and redundancy built into any architecture. We have locked down our environments and our using the maintenance windows you have seen to triple check redundancy exists as it should.
As you have seen, we have had multiple maintenance windows over the course of the last several weekends. Each of these have been successful and well executed. To implement the needed changes, we are going to have a series of maintenance events over the course of the next several weekends as well, likely through the end of March. We will do everything possible, as fast as possible, as carefully as possible, and as non-intrusively as possible to implement the changes. Most will involve no system impact, some will require it.
Rather than just send weekly emails, I wanted to be transparent here around where we are at. I received some valid complaints from customers around lack of notice and those are valid. To take the mystery out of that, I want to get us all on the following cadence:
- Let you see our current master maintenance schedule which is located here: https://community.jivesoftware.com/docs/DOC-37847
- Every Wednesday we will publish
- Final scope of what we are planning for the subsequent weekend
- This should be what is in the master schedule save any tweaking we have decided to do and not change materially
- What the potential impact will be and the hours impact could occur (example, a failover test should not affect you, unless it for some reason fails)
- A detailed schedule published Thursday evening detailing all of the timing and detailed steps
- The ability to follow progress real time if you should so choose to as we progress through the maintenance activities
- Every Monday we will publish a full debrief on all of the previous weekends maintenance activities and then follow the communication cadence again for the following week
For those of you who have internal processes and communication lead time, you should set the above expectations now that there will be maintenance through mid-April, it will most likely not involved down time on most weekend but that there is always a potential service interruption of something unexpected occurs.
It our expectation that by the middle of April the required instrumentation, architecture, capacity, redundancy, and personnel will largely be in place. We will also have tested failure scenarios to assure that we have a resilient hosting environment that is resistant to issues that will impact our customers or be visible to their end users.
We will continue to update on our progress and will be proactive in communicating our successes or any issues we may encounter.