In part 1 of this series we outlined the issues we had with the 4.0 and earlier cache system design and implementation. Today we'll outline the desired features for the next generation of Jive SBS's caching technology.

 

The desired features

 

After some debate the major desired features were narrowed down to the following list:

 

  • Caches should be run in their own process. No longer should clustered caches run in the main application server process.
  • Caches should be able to be distributed across multiple cache servers to eliminate any single point of failure.
  • High availability of cache data. Often this means replicating cache data across multiple nodes such that losing a single node doesn't mean losing all the cached data on that node.
  • Fault tolerant. If the whole caching system goes down the application can still run - though performance will understandably suffer under these circumstances
  • Reliable. The caching solution should be reliable in the face of server and network issues and can handle data consistency issues that can arise from server or network flap, write races, etc.
  • The cache api should not be Map based. Instead, caches operations should largely be limited to get(key), put(key), remove(key) and clear().
  • Serialization should be transparent in the majority of cases and shouldn't require developers to implement methods for every single class that is to be cached.
  • Individual caches shouldn't require a maximum size to be specified. Instead, the cache server should just figure out which objects to evict across all caches based on normal cache eviction policies such as LRU (Least Recently Used), second chance FIFO, etc.
  • Determining the size of cached objects if required at all should be handled by the caching system itself and not in the classes to be cached.
  • All cache operations must operate on serialized data.
  • Eventually consistent. As Jive SBS grows and scales up having every cache be coherent across the cluster is not something that we see as consistent with our performance goals.
  • Source available. Having source code available when troubleshooting an issue can be of tremendous help in solving the issue in a timely manner.
  • TCP based. While UDP based caches work well for many other systems and applications it is our belief that a TCP based solution will work best for Jive SBS. This is especially true when Jive SBS is run within a virtualized environment. This also means no multicast for node discovery.
  • Dynamic cluster membership. While typically a rare operation having to restart a whole cluster to add a new node isn't desirable
  • Performance should be as good as if not better then the current implementation in Jive SBS 4
  • If possible, the cache server should Java based. Our expertise is primarily in Java and for us it's highly desirable that any developer could jump in and quickly understand the code without having to learn a new programming language

 

With this list of requirements defined the process of implementing the new cache system began. Among the first thing that was decided was that since we didn't know which distributed caching solution we would use at the time (the list of potential solutions was quite long initially) we would take a provider based approach for the main caching solution and would use the decorator pattern to layer functionality on top of that cache. This allowed us to quickly prototype a number of potential solutions to find one that would be the best fit to our requirements.

 

The solutions we prototyped

 

Memcached being the most popular distributed caching solution in common use was an easy choice as one of the solutions to prototype. After much research and discussion it was decided that we would also prototype two other solutions, one being Ehcache and the second being a rather unorthodox use of Voldemort, a java-based persistent key-value store developed by LinkedIn. All potential solutions were open source and in production use at various companies.

 

In part 3 of this series we will outline the reasoning for our solution choice as well as what we did to overcome a couple of issues that threatened to derail the project.