This is really something we hear regularly in Aurea Support and in most cases flow control is the cause.

Many customers heard about flow control, but most are not fully aware about the details.

Some even consider it a product bug or limitation.

I understand that it can cause pain, but there is a reason for it, which is why I thought it is worth explaining it in more detail:

 

What is flow control?

In a messaging system you always have a producer and a consumer. Ideally the consumer is at least as fast at processing messages as the producer. In reality this is not always possible.

Reasons are spikes in load, outages on consumer side or simply not well designed architecture.

CX Messenger (Sonic) provides of course some buffers but once these are full, message processing is impacted.

By default the producer is simply blocked until space is available on broker side to take the next message.

This is what we call flow control.

 

Let’s get into a bit more detail on this per JMS messaging domain:

 

Point-to-Point (Queues)

Recap of PTP basics: n producers per queue, n consumers per queue allowed, only one of the consumers of the queue will get the message.

 

If the consumers are not fast enough (or disconnected) the broker will queue the messages per queue. Each queue has two configuration options, Save Threshold and Maximum Size.

The Maximum Size defines how many kilobytes of message data the queue can hold.

The Save Threshold defines how much of this date is kept in memory, rest goes to disk.

Once the Maximum Size is reached flow control kicks in.

 

Publish/Subscribe (Topics)

Recap of PubSub basics: n producers per topic, n consumers per topic allowed, each consumer of the topic will get the message.

 

If the consumers are not fast enough the broker will queue the messages per subscriber. Each subscriber has buffers which are configured globally in the broker properties.

Once the buffer (per subscriber) of one subscriber (of given topic/pattern) is full, flow control kicks in on the particular topic. This means the slowest subscriber defines/limits the message delivery rate to all subscribers.

To be clear: at that point all the other subscribers on that topic no longer get messages and the publisher is blocked.

(In case you were wondering, yes it is key to detect this guy to prevent flow control. We will get there soon.)

 

Can I avoid flow control?

Now that you know that there are limiting factors, questions might be:

 

     "How to avoid such situations?

     Or how can flow control be avoided at all?

     But is it really a bad thing?

     Does it even help in your architecture?"

 

CX Messenger JMS API allows you to disable it, which will then cause an exception on the message producer side once flow control would kick in.

In most architectures though you would not want to do that, but rather get to the bottom of the cause and act accordingly.

 

So how can you avoid/reduce flow control? As you might guess there is no simple answer to it. It all depends on the cause and is very specific to each implementation.

There are buffers and there is the pace at which messages are produced and consumed. These are the key factors that you have to look at.

 

e.g.

  • For PTP you can increase the number of consumers to ensure messages are consumed faster. A larger maximum queue size will help on spikes on messaging load, but will increase latency (messages might stay longer in the queue).
  • Similar to PTP you can increase buffers for PubSub, but again there is latency impact and also memory impact. In addition there is this magic switch called “Flow To Disk” which allows you to use the whole hard disk as buffer.

 

     “So I just enable that magic switch and all good, great!”

 

Wrong, let me stop your enthusiasm here for a moment.

I personally think Flow To Disk is the worst feature we have.

You wonder why?

The feature itself is great, but the way how it is often used is causing issues. It simply hides bad architecture and bad configuration. People tend to enable it by default. Do not want to invest in proper load tests and architectural/configuration changes. Then once all is stuck (e.g. disk full or memory reference buffer is full) Aurea Support is pulled in and is supposed to fix it.

At this stage though most projects are already live and cannot easily make major changes.

Hopefully this blog post helps you to not make the same mistake.

 

FlowToDisk notification:

 

Back to PubSub: Another option to avoid/reduce flow control is to use shared/grouped subscribers.

It will ensure that each message is only consumed once per shared group.

This allows you to have parallel processing of messages per group but only once per message.

 

How do I know what the cause of flow control is in my architecture?

I hope by now you are convinced that flow control is great and Flow to Disk has to be used with caution.

So the question is: how do you even know that you run into flow control?

 

To detect whether your current deployment is stuck due to flow control the quickest is to get a Java thread dump using  "jstack -l  <pid>".

Look for threads blocked within a 'Job.join' call inside a send or publish.  This indicates that the client is waiting to send a message to the broker and is most commonly due to flow control.

 

For example:

 

"JMS Session Delivery Thread" (TID:0x101E7D30, sys_thread_t:0x3DDDBE8, state:CW, native ID:0x1F9C) prio=5

    at java.lang.Object.wait(Native Method)

    at java.lang.Object.wait(Object.java(Compiled Code))

    at progress.message.zclient.Job.join((Compiled Code))

    at progress.message.zclient.Publication.join((Compiled Code))

    at progress.message.zclient.Session.publishInternal((Compiled Code))

    at progress.message.zclient.Session.publishInternal((Compiled Code))

    at progress.message.zclient.Session.publish((Compiled Code))

    at progress.message.zclient.Session.publish((Compiled Code))

    at progress.message.jimpl.MessageProducer.internalSend((Compiled Code))

    …

 

 

From a proactive monitoring perspective there are several options that the product offers.

Which of the options is best for you depends on product usage.

 

You can setup flow control related broker notifications. PubPause/SendPause notifications are the starting point.

There are additional notifications (e.g interbroker flow control) as well which you should make yourself familiar with.

These notifications may cause a lot of noise and rarely operations team really investigate these notifications.

Some advanced teams offload these to ElasticSearch for analytics. Of course the noise is less the better you configured the system.

These notifications allow you to identify which consumer is causing flow control. The details are available in the PubPause notification:

 

 

 

Note: PubPause/PubResume does not apply/work if you use a shard/group subscription!

     (SlowSubscriber and BackloggedSessionSkip is key here, see below)

 

Especially for PubSub the flow control monitoring has more options. In case you have enabled Flow To Disk the disk usage of the pubsub store and the memory usage of the Flow To Disk can be monitored.

There is another notification which helps to identify slow subscribers and especially (but not limited to) for shared subscribers this is super helpful: application.session.SlowSubscriber

 

 

If a message is stuck for a defined number of milliseconds at the front of the subscribers buffer a notification is generated.

This does not replace PubPause but it allows you to detect stuck messages even if no flow control kicked in (yet).

(for PTP the queue.messages.TimeInQueue notification is the best equivalent. It allows you to get notified if a message is pending for too long in a queue.)

 

Related to the slow subscriber monitoring there is another corner case where a shared subscriber might back up on one member of the group. Normally this would cause the whole group to be slowed down, but might not even cause flow control. In more recent releases this has been improved to favor the faster clients while distributing messages in a group.

 

A new notification application.session.BackloggedSessionSkip is raised to identify clients that are backing up.

 

 

Once you identified the consumer(s) that cause this the next question is: Why is the consumer application not fast enough?

 

The answer to that will be given in my next blog post.

 

 

 

References:

How can a thread dump be generated from a Sonic Container or Client?

Assessing Flow Control condition.

How to monitor subscribers to identify slow message consumption?

Slow shared subscriber impacts other subscribers in the group

Monitoring for flow control using the Sonic Management Console

What is Flow to Disk?

Under what condition a publisher might get flow controlled even though flow to disk is enabled?

Publisher flow controlled even though FlowToDisk is enabled.