4 Replies Latest reply on Apr 3, 2017 11:35 AM by leona.campbell

    On-Premises Cluster Docs Wrong?

      The documentation and my peers talk of webapp-to-webapp communication in the cluster and a concept of a "senior node" which runs a JMS broker and handles distributed locks.


      So imagine my surprise when troubleshooting a likely infrastructure issue we had that caused nodes to exit the cluster that I found no evidence of healthy clustered webapp nodes communicating with each other. As evidenced by netstat, they talk to the database, the cache nodes, search (on-prem), eae and document converters, but they have no active tcp connections between them and aren't even in each others' arp tables.


      I'm not sure what port a JMS broker would be on, but I can find no evidence of any services besides the httpd port and the tomcat port open by Jive. This holds true for our existing dev environment and a newly-created dev cluster.


      We are still running Jive 7, but both Jive 7 and Jive 8 cluster overview and config documentation speak of this webapp intercommunication and the "senior node" starting a JMS broker. From older conversations in this community I see screenshots of older versions of Jive having local node contact info for each node, but that info is no longer in the admin panel, and I can't find it in the database. The only cluster info I can find in the db is the JiveCluster table which has one row for each node with the node ID, host name and last touch timestamp.





      Are the docs wrong, or am I just missing this JMS broker, senior node activities and inter-webapp communication somehow?


      From my observations, this seems to be how it really works:

      • webapps update their touch times usually once per minute (I have seen it happen more frequently if one of the nodes is late/missing)
      • If a webapp's last touch time is roughly > two minutes old, the admin panel will show that cluster node as offline
        • If the webapp reconnects, I have seen it rejoin the cluster, and I have seen it stay out...unsure yet if it's timing or something else
      • If a webapp is in the cluster it talks to the cache node(s)
      • If a webapp is out of the cluster it doesn't talk to the cache node(s) anymore but continues to serve requests, and its out-of-sync "near cache" may cause it to serve stale data especially in a busy cluster
      • If webapp nodes are in the cluster and the cache node(s) go offline, the webapps continue to say they're in the cluster and do not seem to diverge, and they log complains in sbs.log about the voldermort server(s) being unreachable. I presume they are skipping their near caches in this case and going straight to db/disk because they don't seem to serve stale data in this scenario
      • As far as I can tell, there is no inter-webapp communication; each node decides on its own whether or not it's in the cluster based on its own last touch time and apparently voluntarily disconnects from the cache node if this happens yet continues to serve requests from its near cache as if it's a singleton