Skip navigation

I ended my previous blog post with the question: “Why is the consumer application not fast enough?“

This is a very common question. The root cause of slowness or increased response time is often not very obvious.

For the end user the system she/he interacts is slow. Reality though is that this is not always true.

 

Let me take you through an example and show what we can do to find the real root cause.

 

Thread dumps and logs are a starting point if the issue is not intermittent (and it is Java).

Simple take a series of thread dumps using “jstack -l <pid>” and pass this to the support team of the product in question.

For intermittent issues and/or heterogeneous environments (e.g. .net + Java) the root cause analysis is more complex.

 

 

The issue can be load, data(size/content) or environment related. The more backend systems involved in one request the more complex it becomes.

 

For this blog post I created an example to illustrate this.

A customer tries to access a web app and is experiencing long wait times after triggering a request.

 

 

The app is a ASP.net page which interacts with a REST API. The REST API itself is sending requests to a JMS queue.

From that queue an integration engine (CX Messenger, Sonic ESB) is picking up the message.

A business process flow is executed and then the response is sent back.

 

Sample Scenario

 

 

As you can see there are several systems involved. You might argue now that this is a constructed example.

I agree but reality is often even way more complex than this.

Which makes it often quite hard to understand all the reasons why the final response time is so high.

 

Back to the example. The customer experience happens at the web page, no matter what is done behind the scenes.

The user reported a wait time of circa ten seconds till the page is loaded. Network issues as potential cause have already been eliminated by the operations teams.

Normally the investigation would now start with logs of all involved  components and teams.

Different technology-stacks, different ops teams, potential collaboration issues, time consuming… etc.

 

 

This is where CX Monitor (Actional) can show one if its strengths.

It allows you to dig into past traffic/interactions using a date time picker or you work proactively (preferred) using policies.

A policy defines a certain rule/condition/target. If the condition is met (e.g response time > 3s) then an action can be triggered.

Typically this action is an alert inside CX Monitor (can be passed to other monitoring systems/dashboards) but can be anything you want.

 

Sample Policy:

 

Policy condition

 

Part of the alert is information about the interactions and (if requested) the data involved in the complete interaction flow.

The flow itself can be reviewed and drilled into. It shows the interactions between the systems and APIs.

 

 

Example Flow Map:

 

Flow map of interactions between systems at given point in time

 

 

 

For “slowness” root cause analysis I personally prefer a different view of the same data. The sequence table which is also part of the alert details in CX Monitor.

In our example it clearly shows us where the time is spent.

 

Example Sequence Table:

 

Sequence table showing time spent in each app

 

The ten seconds reported by the customer are confirmed by this. The sequence map shows that the time is spent in these components:

 

  1. 4 seconds in the aspx page before the REST call is made
  2. 2 seconds on the ESB business process on log file write
  3. 3 seconds on a JDBC/SQL call done by the app that is exposing the REST API
  4. 1 second again on this REST API app after the database call

 

Having this information promptly at hand can save a lot of time and gives valuable insights into monitored systems.

Monitoring can be done for a single application (e.g. Aurea CRM connector/interface/webservices) or a complete heterogeneous environment. Seamless view of interactions across technology stacks (.net, java, etc).

You can even add monitoring capabilities to your custom solution app.

 

You have access to all this via Aurea Unlimited which means no extra cost for your company.

 

Further reading:

Is it possible to monitor ACRM using CX Monitor (Actional)?

This is really something we hear regularly in Aurea Support and in most cases flow control is the cause.

Many customers heard about flow control, but most are not fully aware about the details.

Some even consider it a product bug or limitation.

I understand that it can cause pain, but there is a reason for it, which is why I thought it is worth explaining it in more detail:

 

What is flow control?

In a messaging system you always have a producer and a consumer. Ideally the consumer is at least as fast at processing messages as the producer. In reality this is not always possible.

Reasons are spikes in load, outages on consumer side or simply not well designed architecture.

CX Messenger (Sonic) provides of course some buffers but once these are full, message processing is impacted.

By default the producer is simply blocked until space is available on broker side to take the next message.

This is what we call flow control.

 

Let’s get into a bit more detail on this per JMS messaging domain:

 

Point-to-Point (Queues)

Recap of PTP basics: n producers per queue, n consumers per queue allowed, only one of the consumers of the queue will get the message.

 

If the consumers are not fast enough (or disconnected) the broker will queue the messages per queue. Each queue has two configuration options, Save Threshold and Maximum Size.

The Maximum Size defines how many kilobytes of message data the queue can hold.

The Save Threshold defines how much of this date is kept in memory, rest goes to disk.

Once the Maximum Size is reached flow control kicks in.

 

Publish/Subscribe (Topics)

Recap of PubSub basics: n producers per topic, n consumers per topic allowed, each consumer of the topic will get the message.

 

If the consumers are not fast enough the broker will queue the messages per subscriber. Each subscriber has buffers which are configured globally in the broker properties.

Once the buffer (per subscriber) of one subscriber (of given topic/pattern) is full, flow control kicks in on the particular topic. This means the slowest subscriber defines/limits the message delivery rate to all subscribers.

To be clear: at that point all the other subscribers on that topic no longer get messages and the publisher is blocked.

(In case you were wondering, yes it is key to detect this guy to prevent flow control. We will get there soon.)

 

Can I avoid flow control?

Now that you know that there are limiting factors, questions might be:

 

     "How to avoid such situations?

     Or how can flow control be avoided at all?

     But is it really a bad thing?

     Does it even help in your architecture?"

 

CX Messenger JMS API allows you to disable it, which will then cause an exception on the message producer side once flow control would kick in.

In most architectures though you would not want to do that, but rather get to the bottom of the cause and act accordingly.

 

So how can you avoid/reduce flow control? As you might guess there is no simple answer to it. It all depends on the cause and is very specific to each implementation.

There are buffers and there is the pace at which messages are produced and consumed. These are the key factors that you have to look at.

 

e.g.

  • For PTP you can increase the number of consumers to ensure messages are consumed faster. A larger maximum queue size will help on spikes on messaging load, but will increase latency (messages might stay longer in the queue).
  • Similar to PTP you can increase buffers for PubSub, but again there is latency impact and also memory impact. In addition there is this magic switch called “Flow To Disk” which allows you to use the whole hard disk as buffer.

 

     “So I just enable that magic switch and all good, great!”

 

Wrong, let me stop your enthusiasm here for a moment.

I personally think Flow To Disk is the worst feature we have.

You wonder why?

The feature itself is great, but the way how it is often used is causing issues. It simply hides bad architecture and bad configuration. People tend to enable it by default. Do not want to invest in proper load tests and architectural/configuration changes. Then once all is stuck (e.g. disk full or memory reference buffer is full) Aurea Support is pulled in and is supposed to fix it.

At this stage though most projects are already live and cannot easily make major changes.

Hopefully this blog post helps you to not make the same mistake.

 

FlowToDisk notification:

 

Back to PubSub: Another option to avoid/reduce flow control is to use shared/grouped subscribers.

It will ensure that each message is only consumed once per shared group.

This allows you to have parallel processing of messages per group but only once per message.

 

How do I know what the cause of flow control is in my architecture?

I hope by now you are convinced that flow control is great and Flow to Disk has to be used with caution.

So the question is: how do you even know that you run into flow control?

 

To detect whether your current deployment is stuck due to flow control the quickest is to get a Java thread dump using  "jstack -l  <pid>".

Look for threads blocked within a 'Job.join' call inside a send or publish.  This indicates that the client is waiting to send a message to the broker and is most commonly due to flow control.

 

For example:

 

"JMS Session Delivery Thread" (TID:0x101E7D30, sys_thread_t:0x3DDDBE8, state:CW, native ID:0x1F9C) prio=5

    at java.lang.Object.wait(Native Method)

    at java.lang.Object.wait(Object.java(Compiled Code))

    at progress.message.zclient.Job.join((Compiled Code))

    at progress.message.zclient.Publication.join((Compiled Code))

    at progress.message.zclient.Session.publishInternal((Compiled Code))

    at progress.message.zclient.Session.publishInternal((Compiled Code))

    at progress.message.zclient.Session.publish((Compiled Code))

    at progress.message.zclient.Session.publish((Compiled Code))

    at progress.message.jimpl.MessageProducer.internalSend((Compiled Code))

    …

 

 

From a proactive monitoring perspective there are several options that the product offers.

Which of the options is best for you depends on product usage.

 

You can setup flow control related broker notifications. PubPause/SendPause notifications are the starting point.

There are additional notifications (e.g interbroker flow control) as well which you should make yourself familiar with.

These notifications may cause a lot of noise and rarely operations team really investigate these notifications.

Some advanced teams offload these to ElasticSearch for analytics. Of course the noise is less the better you configured the system.

These notifications allow you to identify which consumer is causing flow control. The details are available in the PubPause notification:

 

 

 

Note: PubPause/PubResume does not apply/work if you use a shard/group subscription!

     (SlowSubscriber and BackloggedSessionSkip is key here, see below)

 

Especially for PubSub the flow control monitoring has more options. In case you have enabled Flow To Disk the disk usage of the pubsub store and the memory usage of the Flow To Disk can be monitored.

There is another notification which helps to identify slow subscribers and especially (but not limited to) for shared subscribers this is super helpful: application.session.SlowSubscriber

 

 

If a message is stuck for a defined number of milliseconds at the front of the subscribers buffer a notification is generated.

This does not replace PubPause but it allows you to detect stuck messages even if no flow control kicked in (yet).

(for PTP the queue.messages.TimeInQueue notification is the best equivalent. It allows you to get notified if a message is pending for too long in a queue.)

 

Related to the slow subscriber monitoring there is another corner case where a shared subscriber might back up on one member of the group. Normally this would cause the whole group to be slowed down, but might not even cause flow control. In more recent releases this has been improved to favor the faster clients while distributing messages in a group.

 

A new notification application.session.BackloggedSessionSkip is raised to identify clients that are backing up.

 

 

Once you identified the consumer(s) that cause this the next question is: Why is the consumer application not fast enough?

 

The answer to that will be given in my next blog post.

 

 

 

References:

How can a thread dump be generated from a Sonic Container or Client?

Assessing Flow Control condition.

How to monitor subscribers to identify slow message consumption?

Slow shared subscriber impacts other subscribers in the group

Monitoring for flow control using the Sonic Management Console

What is Flow to Disk?

Under what condition a publisher might get flow controlled even though flow to disk is enabled?

Publisher flow controlled even though FlowToDisk is enabled.