At Jive we strive to satisfy our customers with the best social business software application that we can build. We have a constant drive to improve what we already have in addition to expanding the envelope and provide a unique social-based solution to common business problems. Our software, Jive SBS, has been a study in constant evolution ever since it's roots in Jive Forums with new functionality being introduced continuously and major architectural changes being made as required. One aspect of the application that hasn't seen a lot of changes over the years however has been the caching layer used by Jive SBS. Like most high performance server applications Jive SBS relies heavily on in-memory caching to improve performance and reduce load on the underlying database. Without this caching layer our application could not nearly do as much as it does nor scale anywhere near the loads seen every day in production by our customers. It's a testament to the initial design choices that the cache implementation had stood the test of time as well as it has. That being said over time it become apparent that there were limitations and problems that needed to be addressed.
The problem with map based caches
Caching (and clustering) in Jive SBS 4 is implemented using Oracle Coherence with each application server in the cluster acting not just as an application server but as a cache server. The caches defined in Jive SBS (of which there are well over a hundred) are based on Coherence's distributed map architecture and benefit (and suffer) from that choice. Having caches be map based leads developers to start to treat the caches no differently than they would any other map. This allows new developers to get up to speed faster but can lead to performance issues, heavier than expected network traffic and other problems that tend to only show up when systems are under heavy load. More importantly design choices which worked well historically have proven to be problematic which has lead us to reevaluate those choices and indeed the full requirements for what Jive SBS's caching layer should look like. What we came up with in that reevaluation was a design that on the surface looked similar to what was in Jive SBS 4 but on closer inspection is significantly improved.
Why we redesigned the cache
As part of the cache system reevaluation we started by going over the concerns we have with the 4.0 and earlier design so we have a basis to compare any solution we came up with against. Outlined below is a sampling of the issues we discussed:
- Having caches only exist in the same process as the application server has proven to be an issue. Caches by their very nature use a lot of memory and if not configured correctly can quickly consume all available memory in the JVM causing the dreaded OutOfMemoryError. This puts a strain on Java's garbage collector which has to balance normal app server traffic which creates a constant stream of many small but largely short lived objects with caches which tend to hold onto objects for a much longer period of time. While having caches be in-process was a great design decision historically it was not without significant risks and issues.
- Every cache is sized independently. While this worked fine when there were just a few caches in the application as the number of caches increased various measures had to be put into place to make administration easier. These measures, including system wide defaults for small, medium and large installations helped but did not solve the underlying problem. Having individually sized caches meant that administrators had to experiment a fair bit with cache sizes to come up with a set of cache sizes resulting in the best cache usage patterns they could get while at the same time didn't cause overall cache usage to go above the recommended percentage of the maximum amount of memory given to the JVM. This pattern of balancing cache sizes was a common practice that administrators had to go through and lead to some caches being undersized while other caches being oversized leading to what amounted to 'wasted' memory.
- The behaviour of caches was different for a single node install vs one with multiple nodes. The behaviour differed in that cached objects in a single node install were not serialized whereas objects in a multiple node installation most often were. This had two repercussions, the first being that developers developing on a single node would not see serialization problems until very late in the development process when they started testing in a cluster - or in some cases not at all if a object was rarely cached. The second repercussion of this behaviour is that the number of objects that could be stored in caches was dramatically different depending on whether the system was clustered or not, a rather surprising situation for many customers.
- While important for the best performance of the system cache operations should not block web requests unnecessarily. Unfortunately all too often this was something that we saw in production when one node in a cluster had memory issues (a memory leak or improper configuration causing a full GC to occur). This tended to have a domino effect where problems on one node would start to cause other nodes to backup resulting in a cluster wide failure. There are possible mitigations to this in Coherence including timeouts (which throw Exceptions) and increasing the replication factor (so cached data is stored on multiple nodes) however neither has been used. Having a replication factor greater then one is a great idea however the increased load on the network always seemed to be an issue - especially on networks which were already strained and/or running on less then stable hardware.
- Manual serialization. The serialization mechanism used in Jive SBS 4 and earlier is Coherence's version of the Externalizable interface in Java. It's extremely fast however it's prone to developer errors and omissions. This is especially true when classes underwent changes due to refactoring, feature improvements or other changes that resulted in new fields being introduced into the class but not included into the readExternal/writerExternal methods used to serialize and deserialize the object. This had the unfortunate behaviour in which updated code seems to work fine, it serialized fine, however when deserialized it was missing data corresponding to the newly introduced fields.
- Manual determination of serialized object size. Originally, with just a few classes that would be serialized it was a fairly simple exercise to determine just how large a serialized version of the object would be without actually having to serialize the object. This information was used in part to determine the sizes of caches and when cache eviction would start to occur. The main issue arising from manually determining this is twofold: firstly it suffers from the same omission problem as described in the previous issue, and secondly, and possibly more importantly, it was often returning a size that wasn't close to the real size of the object. This resulting in caches thinking they were smaller then they really were leading on occasion to the dreaded OutOfMemoryError.
- Caches were not transaction aware. While the vast majority of transactions commit successfully there are always transactions that for one reason or other fail and have to be rolled back. With caches not being transaction aware any cache puts that were made prior to the method call resulting in the transaction being rolled back were now available to every node in the cluster. This invalid data now would have to be manually rolled back out of the cache, a very tedious and error prone process which was often overlooked.
These and other issues lead us to the conclusion that a redesign was in order. Tomorrow in part 2 of this series we will go the features we wanted in the next generation of Jive SBS's caching technology and outline some of the solutions we prototyped and evaluated