Some of our current work here at jive involves building systems which involve a bunch of logical services, each hosted in multiple processes per service. Knowing the state of the system involves using multiple methods to gather a gestalt sense of healthy or unhealthy. One simple method we use is to have a 'status' rest endpoint hosted by every service instance. In implementing the code backing this endpoint, we were faced with the questions of what exactly does it mean to be unhealthy, and what flavors of unhealthy do we need to differentiate? It was simple and obvious until we actually had to think about it. Here's what we came up with for states that would fail this simple status-check-via-rest-call:
Before we even access the status endpoint of a service there are a couple failures we can trap:
1. Process down. That's the most obvious one, right? If it's dead, that's bad. But what if it's being restarted as part of a deploy? Our status check will return false if we can't connect to the endpoint, but to avoid false failure alerts we have to coordinate between deploys and status check timing.
2. Process is there, but the request for status itself times out. If the service can't serve an http request, that's also bad. It may not be any fault of the code or the host it's running on, it just might be overloaded, and the real problem is we need to stand up more of them.
Now on to the states which the status call implementation reports:
1. Request processing latency mean is over some configured threshold. Some of our services process rest calls, others read a stream of events off of a bus. Each of these types consumers track a round trip latency metric that we can use for this check.
2. Error level messages have been logged in the past X seconds. The idea here is that if any code in the service knows there is a problem, it will log it, so the fact an error level log message went out means something somewhere thinks it is unhealthy. Using this check also has a couple side effects. First, it eases the first step an ops or dev (or devops) person goes through when looking for trouble: figuring out whose logs to look at. When you have 50 logs to pick from it's nice to know up front which have recent errors. Second, if a developer logs at error level when they shouldn't, those messages are going to trip alerts and someone will come ask them about it. It's a good incentive to be conscientious about what log levels you use.
3. Error level messages have been logged while trying to interact with some other part of the system in the past X seconds. This is exactly the same as the previous check, but is a way for a service to indicate the difference between "I have problems" and "Something I rely on has problems." For example, if there is failure problem connecting to HBase, or Kafka, or making a rest call to some other service, the resultant error level log messages will be indicated separately. This helps us follow a chain of failures to the source of the problem.
4. Inability to log, send metrics, or send service announcements to indicate presence. We rely on a service being able to do these basic things to participated in the system, so inability to do them results in a status check failure.
5. Percentage of time spent in GC is over some configured threshold. Often we go to gc logs to see what's going on with a slow process, and this check gives us an indication of whether there might be a reason to look there.
In short, we've decided to track:
3. Upset about itself
4. Upset about something else
5. Cut off from the rest of the system
6. Memory problems
Any service can add its own health checks which target some application specific state, but these are standard across all of them. They give us an indication not only that something is wrong, but just as important, they can point us in an initial direction when doing a root cause analysis.