Stale node ids should automatically be removed in Jira Data Center


Finally finally … Atlassian takes action on this one since reported 5 years ago. As stated in the 8.10.x release notes, Jira 8.10.0 and later now is able to remove stale node automatically.

We’ve introduced automation to your cluster maintenance. Now, you do not have to remove the offline nodes from your cluster manually, or move the nodes that report no heartbeat offline. After two days of a node reporting no heartbeat it’s automatically moved offline, and after two days of remaining in the offline state, it’s automatically removed from the cluster.

Additionally, we give you a full overview of what is going on in your cluster in your Advanced audit log. This way, if a node goes offline or is removed from cluster you know it in an instant.

I think one of the reasons that push the change is the adoption of Cloud. Comparing to the traditional self-hosted datacenter, Cloud environment is more dynamic and everything is on-demand. Take ourselves as a example, we can build a brand new Jira Data Center stack from scratch within 30 minutes. It includes baking AMI, creating RDS instance and EFS, Jira cluster auto-scaling group. Jira is a stateful application, it prefers a static environment. I guess thats why it did not come with the capability to automatically remove the stale nodes.

When we run Jira that is older than 8.10.0 in AWS, we encountered the exact issue of having the stale nodes in the database. As instead of making in-place change, we replace the Jira nodes with new ones when doing tasks like patching and upgrade. As we would like to take the advantage of some Cloud native technologies, like rolling updates.

This issue actually can cause performance degrade, as the live nodes keep trying to sync the caches with the ones that have already gone (terminated). Our solution is to have a Ansible playbook that does the house keeping work for us. If you use Jira prior to 8.10.0, you may be interested to check it out:

default/main.yml

---

atl_db_root_password: "{{ (lookup('aws_secret', '/jira/database/master', region='ap-southeast-2')| from_json).get('password') }}"
current_epoch_time_sec: "{{ ansible_date_time.epoch }}"
delete_threshold_hours: '24'

tasks/main.yml

---

- name: Remove offline jira nodes from clusternode table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: DELETE FROM clusternode WHERE node_state='OFFLINE'

- name: Remove offline jira nodes from clusternodeheartbeat table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: DELETE FROM clusternodeheartbeat WHERE heartbeat_time=0

- name: Find jira nodes which heartbeat timestamp are older than delete_threshold_hours
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: SELECT node_id FROM clusternodeheartbeat WHERE %(current_epoch_time_ms)s - heartbeat_time > %(delete_threshold_ms)s
    named_args:
      current_epoch_time_ms: "{{ current_epoch_time_sec | int * 1000}}"
      delete_threshold_ms: "{{ delete_threshold_hours | int * 3600 * 1000 }}"
  register: old_heartbeat_nodes

- name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternode table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: "DELETE FROM clusternode WHERE node_id = '{{ item.node_id }}'"
  with_items: "{{ old_heartbeat_nodes.query_result }}"
  when: old_heartbeat_nodes.rowcount > 0

- name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternodeheartbeat table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: "DELETE FROM clusternodeheartbeat WHERE node_id = '{{ item.node_id }}'"
  with_items: "{{ old_heartbeat_nodes.query_result }}"
  when: old_heartbeat_nodes.rowcount > 0

- name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from replicatedindexoperation table
  postgresql_query:
    login_host: "{{ atl_db_host }}"
    port: "{{ atl_db_port }}"
    login_user: "{{ atl_db_root_user }}"
    login_password: "{{ atl_db_root_password }}"
    db: "{{ atl_jdbc_db_name }}"
    query: "DELETE FROM replicatedindexoperation WHERE node_id = '{{ item.node_id }}'"
  with_items: "{{ old_heartbeat_nodes.query_result }}"
  when: old_heartbeat_nodes.rowcount > 0  


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s