Back in July, I wrote about the node status saga.

1 year and 4 months later and a major jump in versions, we finally have a self-service method to fix this problem.

Atlassian introduced an API for deleting old nodes!

Fixed in 8.1.0, so this post is behind the times. To be fair (tongue), Atlassian addressed this issue and produced a solution within 5 years….that’s pretty good right? (thumbs down)

I’m writing this post in November, they put this fix out in July so it must not have been terribly urgent I suppose.

Looking at this API: Maybe there are useful features here!

[ 
   { 
      "nodeId":"\"ip-10-0-3-7\"",
      "state":"ACTIVE",
      "lastStateChangeTimestamp":1574648449656,
      "ip":"10.0.3.7",
      "cacheListenerPort":40001,
      "nodeBuildNumber":805001,
      "nodeVersion":"8.5.1",
      "alive":true
   }
]

If we know when the node status changed, we can compare and make a time diff evaluation of whether the node should be removed or not.

#!/bin/bash
 
 
USRNM=${1}
PSSWD=${2}
BASEURL=${3}
 
for i in $(curl --user ${USRNM}:${PSSWD} -sb --url "${BASEURL}/rest/api/2/cluster/nodes" | jq -r '.[] | select(.alive=="false",.state=="OFFLINE") | .nodeId' | tr '\n' ' ')
do
  # printf "\n\n Node ID: ${i} is being removed"
  currentTime=$(python -c 'from time import time; print int(round(time() * 1000))')
  lastStateChangeTimestamp=$(curl --user ${USRNM}:${PSSWD} -sb --url "${BASEURL}/rest/api/2/cluster/nodes" | jq -r '.[] | select(.alive=="false",.state=="OFFLINE") | .lastStateChangeTimestamp' | tr '\n' ' ')
  timeDiff=$(( ($currentTime-$lastStateChangeTimestamp)/60000 ))
  echo "Time DiFF = $timeDiff"
  if [ $timeDiff -gt 20 ]; then
    printf "\n\n Node ID: ${i} is being deleted because it is idle or invalid in the cluster, and changed to inactive more than 20 mins ago."
    curl -X "DELETE" --user ${USERNAME}:${PASSWORD} -sb --url "${BASEURL}/rest/api/2/cluster/node/${i}"
  else
    printf "\n\n Node ID: ${i} is inactive...lets give it a few more mins to re-join the cluster"
  fi
done

Our Click2Clone using clients who are on data center, would benefit from a new job that checks this very thing periodically and then removes the stale nodes!