Stale node ids should automatically be removed in Jira Data Center

Finally finally … Atlassian takes action on this one since reported 5 years ago. As stated in the 8.10.x release notes, Jira 8.10.0 and later now is able to remove stale node automatically.

We’ve introduced automation to your cluster maintenance. Now, you do not have to remove the offline nodes from your cluster manually, or move the nodes that report no heartbeat offline. After two days of a node reporting no heartbeat it’s automatically moved offline, and after two days of remaining in the offline state, it’s automatically removed from the cluster.
Additionally, we give you a full overview of what is going on in your cluster in your Advanced audit log. This way, if a node goes offline or is removed from cluster you know it in an instant.

I think one of the reasons that push the change is the adoption of Cloud. Comparing to the traditional self-hosted datacenter, Cloud environment is more dynamic and everything is on-demand. Take ourselves as a example, we can build a brand new Jira Data Center stack from scratch within 30 minutes. It includes baking AMI, creating RDS instance and EFS, Jira cluster auto-scaling group. Jira is a stateful application, it prefers a static environment. I guess thats why it did not come with the capability to automatically remove the stale nodes.

When we run Jira that is older than 8.10.0 in AWS, we encountered the exact issue of having the stale nodes in the database. As instead of making in-place change, we replace the Jira nodes with new ones when doing tasks like patching and upgrade. As we would like to take the advantage of some Cloud native technologies, like rolling updates.

This issue actually can cause performance degrade, as the live nodes keep trying to sync the caches with the ones that have already gone (terminated). Our solution is to have a Ansible playbook that does the house keeping work for us. If you use Jira prior to 8.10.0, you may be interested to check it out:

default/main.yml

---

atl_db_root_password: "{{ (lookup('aws_secret', '/jira/database/master', region='ap-southeast-2')| from_json).get('password') }}"
current_epoch_time_sec: "{{ ansible_date_time.epoch }}"
delete_threshold_hours: '24'

tasks/main.yml

---

- name: Remove offline jira nodes from clusternode table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: DELETE FROM clusternode WHERE node_state='OFFLINE'

- name: Remove offline jira nodes from clusternodeheartbeat table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: DELETE FROM clusternodeheartbeat WHERE heartbeat_time=0

- name: Find jira nodes which heartbeat timestamp are older than delete_threshold_hours
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: SELECT node_id FROM clusternodeheartbeat WHERE %(current_epoch_time_ms)s - heartbeat_time > %(delete_threshold_ms)s
    named_args:
      current_epoch_time_ms: "{{ current_epoch_time_sec | int * 1000}}"
      delete_threshold_ms: "{{ delete_threshold_hours | int * 3600 * 1000 }}"
  register: old_heartbeat_nodes

- name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternode table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: "DELETE FROM clusternode WHERE node_id = '{{ item.node_id }}'"
  with_items: "{{ old_heartbeat_nodes.query_result }}"
  when: old_heartbeat_nodes.rowcount > 0

- name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternodeheartbeat table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: "DELETE FROM clusternodeheartbeat WHERE node_id = '{{ item.node_id }}'"
  with_items: "{{ old_heartbeat_nodes.query_result }}"
  when: old_heartbeat_nodes.rowcount > 0

- name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from replicatedindexoperation table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: "DELETE FROM replicatedindexoperation WHERE node_id = '{{ item.node_id }}'"
  with_items: "{{ old_heartbeat_nodes.query_result }}"
  when: old_heartbeat_nodes.rowcount > 0

	Levon Ritter on AWS DataSync vs S3 Sync
	Joe on AWS Bedrock AgentCore: Enterpr…
	ABDUL YASEEN BABA MO… on TSM
	Heather W on Puppet push Nagios
	Umesh Kumar on Yum gets ‘HTTPS Error 40…
	Pavel on Check Confluence team calendar…
	withanHdammit on Renew AWS credential for a lon…
	Unleashing the Power… on Image-Reader: A project to exp…
	Bob on Build docker image with kaniko…
	Voces De La Tierra on Puppet for Windows: Remote…

Stale node ids should automatically be removed in Jira Data Center

Published by Jackie Chen

Leave a comment Cancel reply

Share this:

Related

Published by Jackie Chen

Leave a comment Cancel reply