With Jira Data Center and auto scaling becoming the most reliable way to run Jira in large organizations, node status and cluster monitoring has become an integral DevOps engineering duty. If you have been running Jira Data Center for any amount of time, you’ve likely seen your system info page display something similar to this:

This is obviously no use to anyone. These are old nodes that have been replaced.

…and if you use a staging instance (which who in their right mind wouldn’t) you might see nodes from your production cluster…cue the unnecessary panic.

Furthermore, your logs are likely being spammed with messages similar to this…

Error: java.rmi.UnknownHostException: Unknown host: atlassian-jira-dev-zjml; nested exception is:
        java.net.UnknownHostException: atlassian-jira-dev-zjml

So, how do we clean this up? Surprisingly, as of June 28, 2018 this is not addressed by Atlassian. If you came here for a hacky workaround…well, here you go! (Vote for this issue if you’d like to not have to deal with this in a hacky workaround.)

WARNING: Always backup your database before running any variation of update, insert, or delete queries against your database. And, it is always best practice to test something like this in a staging instance.  If you need help creating a staging instance, let me know!

This operation does not require the services to be restarted. In fact, stopping the service gracefully will put your valid nodes into an “OFFLINE” state (as it should).

These offline/inactive nodes are in the database tables “clusternode” and “clusternodeheartbeat“.

Identify the nodes that are offline:

select node_id from clusternode where node_state ='OFFLINE';
 
         node_id
-------------------------
 atlassian-jira-dev-c6s5
 atlassian-jira-dev-qvb8
 atlassian-jira-dev-c3gw
 atlassian-jira-dev-p342
(4 rows)

Remove the nodes:

delete from clusternodeheartbeat where node_id in (select node_id from clusternode where node_state ='OFFLINE');
delete from clusternode where node_state ='OFFLINE';

There is also a file system component in the shared home directory. Delete the files for the invalid nodes from the “/<SHARED_HOME>/node-status” directory.

Once this is done, you should be able to refresh your system info page. You will notice that the invalid nodes are gone, and the error messages stop in the logs.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *