Skip navigation

In an always-connected world, business needs to integrate a complex network of systems and applications to keep pace with their customer expectations.  These expectations continue to drive the surge of “big data” that is estimated to generate 2.5 exabytes (2.5Bn gigabytes) of information per day across the whole of the internet.   As more networks come online, organizations need a Middleware provider that can continue to supply all of the connectors needed to make these systems communicate without getting bogged-down in time-consuming custom implementations.

 

It was a crisp morning in the fall of 1968 when the NATO science committee met in the Stadt of Garmisch Germany.  The topic of this particular day, Middleware. From that point on it was recognized that the only way to make complex computer systems speak to each other was through the use of a Middleware layer.  While different systems have come and gone over the years Middleware has largely remained the same, continuing to offer the same types of integration capabilities regardless of OS, DB or application.

 

As we move into Middlewares 5th decade of existence Aurea is taking a fresh perspective on our CX platform to determine how it can continue to serve the interests of our global partners.  Enterprises are becoming more complex with applications, hosting and database usage expanding exponentially ever year. With this in mind, Aurea is positioning to stay at the head of the pack when it comes to ESB technology and connectors offered to our clients. Building upon our history of providing a rock-solid and fast ESB, Aurea has started to architect the future, looking at what truly matters to enterprises utilizing our Middleware services. 

 

As we continue to finalize on the vision of what an ESB looks like for the 21st century (microservices, connectors, services etc.) and beyond, the Aurea CX platform will continue to grow and adapt to the needs of our clients along with laying out a long-term roadmap to get us there.  Tentatively we will be bringing our vision to market early in 2020. We will continue to update you as we make progress and are excited to share this journey with you.

We are excited to announce the  Aurea Customer Experience Platform 2019.1 release is now available. This release provides improved quality, stability and enhanced speed.

 

CX Messenger

  • Improved performance, CX Messenger installer is now 32x faster
  • Resolved 15 bugs


CX Monitor

  • Resolved 17 bugs


CX Process

  • Resolved 50+ bugs

 

To access the latest versions along with app product documentation and release notes, visit the product library in the Aurea Support Portal.

 

If you have any questions, please contact Aurea Support or your account manager for more information. We appreciate your continued partnership.

This is really something we hear regularly in Aurea Support and in most cases flow control is the cause.

Many customers heard about flow control, but most are not fully aware about the details.

Some even consider it a product bug or limitation.

I understand that it can cause pain, but there is a reason for it, which is why I thought it is worth explaining it in more detail:

 

What is flow control?

In a messaging system you always have a producer and a consumer. Ideally the consumer is at least as fast at processing messages as the producer. In reality this is not always possible.

Reasons are spikes in load, outages on consumer side or simply not well designed architecture.

CX Messenger (Sonic) provides of course some buffers but once these are full, message processing is impacted.

By default the producer is simply blocked until space is available on broker side to take the next message.

This is what we call flow control.

 

Let’s get into a bit more detail on this per JMS messaging domain:

 

Point-to-Point (Queues)

Recap of PTP basics: n producers per queue, n consumers per queue allowed, only one of the consumers of the queue will get the message.

 

If the consumers are not fast enough (or disconnected) the broker will queue the messages per queue. Each queue has two configuration options, Save Threshold and Maximum Size.

The Maximum Size defines how many kilobytes of message data the queue can hold.

The Save Threshold defines how much of this date is kept in memory, rest goes to disk.

Once the Maximum Size is reached flow control kicks in.

 

Publish/Subscribe (Topics)

Recap of PubSub basics: n producers per topic, n consumers per topic allowed, each consumer of the topic will get the message.

 

If the consumers are not fast enough the broker will queue the messages per subscriber. Each subscriber has buffers which are configured globally in the broker properties.

Once the buffer (per subscriber) of one subscriber (of given topic/pattern) is full, flow control kicks in on the particular topic. This means the slowest subscriber defines/limits the message delivery rate to all subscribers.

To be clear: at that point all the other subscribers on that topic no longer get messages and the publisher is blocked.

(In case you were wondering, yes it is key to detect this guy to prevent flow control. We will get there soon.)

 

Can I avoid flow control?

Now that you know that there are limiting factors, questions might be:

 

     "How to avoid such situations?

     Or how can flow control be avoided at all?

     But is it really a bad thing?

     Does it even help in your architecture?"

 

CX Messenger JMS API allows you to disable it, which will then cause an exception on the message producer side once flow control would kick in.

In most architectures though you would not want to do that, but rather get to the bottom of the cause and act accordingly.

 

So how can you avoid/reduce flow control? As you might guess there is no simple answer to it. It all depends on the cause and is very specific to each implementation.

There are buffers and there is the pace at which messages are produced and consumed. These are the key factors that you have to look at.

 

e.g.

  • For PTP you can increase the number of consumers to ensure messages are consumed faster. A larger maximum queue size will help on spikes on messaging load, but will increase latency (messages might stay longer in the queue).
  • Similar to PTP you can increase buffers for PubSub, but again there is latency impact and also memory impact. In addition there is this magic switch called “Flow To Disk” which allows you to use the whole hard disk as buffer.

 

     “So I just enable that magic switch and all good, great!”

 

Wrong, let me stop your enthusiasm here for a moment.

I personally think Flow To Disk is the worst feature we have.

You wonder why?

The feature itself is great, but the way how it is often used is causing issues. It simply hides bad architecture and bad configuration. People tend to enable it by default. Do not want to invest in proper load tests and architectural/configuration changes. Then once all is stuck (e.g. disk full or memory reference buffer is full) Aurea Support is pulled in and is supposed to fix it.

At this stage though most projects are already live and cannot easily make major changes.

Hopefully this blog post helps you to not make the same mistake.

 

FlowToDisk notification:

 

Back to PubSub: Another option to avoid/reduce flow control is to use shared/grouped subscribers.

It will ensure that each message is only consumed once per shared group.

This allows you to have parallel processing of messages per group but only once per message.

 

How do I know what the cause of flow control is in my architecture?

I hope by now you are convinced that flow control is great and Flow to Disk has to be used with caution.

So the question is: how do you even know that you run into flow control?

 

To detect whether your current deployment is stuck due to flow control the quickest is to get a Java thread dump using  "jstack -l  <pid>".

Look for threads blocked within a 'Job.join' call inside a send or publish.  This indicates that the client is waiting to send a message to the broker and is most commonly due to flow control.

 

For example:

 

"JMS Session Delivery Thread" (TID:0x101E7D30, sys_thread_t:0x3DDDBE8, state:CW, native ID:0x1F9C) prio=5

    at java.lang.Object.wait(Native Method)

    at java.lang.Object.wait(Object.java(Compiled Code))

    at progress.message.zclient.Job.join((Compiled Code))

    at progress.message.zclient.Publication.join((Compiled Code))

    at progress.message.zclient.Session.publishInternal((Compiled Code))

    at progress.message.zclient.Session.publishInternal((Compiled Code))

    at progress.message.zclient.Session.publish((Compiled Code))

    at progress.message.zclient.Session.publish((Compiled Code))

    at progress.message.jimpl.MessageProducer.internalSend((Compiled Code))

    …

 

 

From a proactive monitoring perspective there are several options that the product offers.

Which of the options is best for you depends on product usage.

 

You can setup flow control related broker notifications. PubPause/SendPause notifications are the starting point.

There are additional notifications (e.g interbroker flow control) as well which you should make yourself familiar with.

These notifications may cause a lot of noise and rarely operations team really investigate these notifications.

Some advanced teams offload these to ElasticSearch for analytics. Of course the noise is less the better you configured the system.

These notifications allow you to identify which consumer is causing flow control. The details are available in the PubPause notification:

 

 

 

Note: PubPause/PubResume does not apply/work if you use a shard/group subscription!

     (SlowSubscriber and BackloggedSessionSkip is key here, see below)

 

Especially for PubSub the flow control monitoring has more options. In case you have enabled Flow To Disk the disk usage of the pubsub store and the memory usage of the Flow To Disk can be monitored.

There is another notification which helps to identify slow subscribers and especially (but not limited to) for shared subscribers this is super helpful: application.session.SlowSubscriber

 

 

If a message is stuck for a defined number of milliseconds at the front of the subscribers buffer a notification is generated.

This does not replace PubPause but it allows you to detect stuck messages even if no flow control kicked in (yet).

(for PTP the queue.messages.TimeInQueue notification is the best equivalent. It allows you to get notified if a message is pending for too long in a queue.)

 

Related to the slow subscriber monitoring there is another corner case where a shared subscriber might back up on one member of the group. Normally this would cause the whole group to be slowed down, but might not even cause flow control. In more recent releases this has been improved to favor the faster clients while distributing messages in a group.

 

A new notification application.session.BackloggedSessionSkip is raised to identify clients that are backing up.

 

 

Once you identified the consumer(s) that cause this the next question is: Why is the consumer application not fast enough?

 

The answer to that will be given in my next blog post.

 

 

 

References:

How can a thread dump be generated from a Sonic Container or Client?

Assessing Flow Control condition.

How to monitor subscribers to identify slow message consumption?

Slow shared subscriber impacts other subscribers in the group

Monitoring for flow control using the Sonic Management Console

What is Flow to Disk?

Under what condition a publisher might get flow controlled even though flow to disk is enabled?

Publisher flow controlled even though FlowToDisk is enabled.