Run Jira on AWS – Challenges vs Solutions


Overview

With the increasing demands and the advantages of Cloud technologies, moving Jira to Cloud is definitely inevitable regardless it is self-hosted Jira in Cloud (IaaS) or Atlassian Jira Cloud (SaaS). As our current project is to migrate Jira to AWS cloud, I will be focus on the challenges vs solutions to run Jira in AWS.

Here is the typical Atlassian Data Centre application infrastructure (Confluence, Jira, Crowd, BitBucket).


When mapping it to AWS, we found the perfect AWS services for each layer of the architecture. By adopting the following AWS services, all layers are highly scalable and highly available.

  • Load Balancer:Application Load Balancer (ALB)
  • Cluster Nodes: Auto-Scaling Group (ASG)
  • Shared Database: Relational Database Service (RDS)
  • Shared File System: Elastic File System (EFS)

With ASG (Auto-Scaling Group), all Jira nodes are short living – the running nodes will be replaced by new nodes when we do cold/hot deployment, AWS decommissions hardware or loses availability zone. However Jira was designed for a static environment which means the Jira nodes are expected to be long living. Most challenges are from this conflict.

Background Knowledge

Let’s start with how Jira tracks nodes in the cluster:

Jira records nodes’ information in the clusternode table,  nodes’ heartbeat in clusternodeheartbeat table, nodes’ recent operations in replicatedindexoperation table, node’s latest processed operation in nodeindexcounter table. 

With the understanding of the four tables, we can continue with how Jira manages the index files: 

In a multiple nodes Jira Data Center cluster, each node keeps the index files locally. e.g When a change is made on one of the node (e.g a new issue is created), the node index that change and also adds an entry to the replicatedindexoperation table and removes it later (by default it is 2 days). So that other nodes can re-play the operation. That means at any given point of time, the index files could be inconsistent across all nodes. This is so called eventual consistency.

When a new Jira node joins the cluster, if it is the first node in the cluster it needs re-indexing or index restore when it does not have valid index files locally. If it not the first node, it queries the database to find out which node has the latest operation. Then the new node sends a request to that node asking for a copy of the index files, but it does not validate the health of the index files on the responding node. The responding node takes a snapshot of its index files and copies the snapshot file to the shared home folder, so the new Jira node can restore from it. If the new node is unable to restore the index files from other nodes (e.g due to timeout or corrupted index files), it won’t restore from the index files backup nor triggers a re-indexing automatically.

When an old Jira node re-joins the cluster, it will compare its local index files with the latest that is recorded in database. If the delta is less than two days, it will replay the operations that are kept in the replicatedindexoperation table. If the delta are more than two days, it will send request to other nodes to get a copy of the latest index files. 

Also it is worth mentioning that Jira node will push the index files to other nodes after it has done a re-indexing. Besides Jira index files, some plugins manage its own index files e.g Zephyr.

NOTE: We are currently running Jira 8.5.8. The newer version of Jira (since 8.12) has introduced DBR (Document Based Replication) which mitigates apps’ impact on indexing time and prevents index inconsistencies in Jira Data Center cluster.

Challenges vs Solutions

Challenge 1: Before version 8.10.0, Jira does not remove stale nodes automatically. As Jira nodes come and go in the auto-scaling group, overtime there will be lots of stale nodes are in the database. And the active nodes will try to run replication with those nodes which actually do not exist anymore. This is a waste of resources, and can degrade Jira performance.

Solution: Use a Ansible playbook to remove the stale nodes automatically. The delete_threshold_hours is configurable, currently we set it as 1 hour. Sample playbook:

name: Check if this is a new cluster that has not been initialized
stat:
path: "{{ atl_product_home_shared }}/export/indexsnapshots"
register: snapshotfolder
name: Remove offline jira nodes from clusternode table
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: DELETE FROM clusternode WHERE node_state='OFFLINE'
when: snapshotfolder.stat.exists
name: Remove no heartbeat record jira node from clusternode table
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: DELETE FROM clusternode AS HB WHERE HB.node_id NOT IN (SELECT node_id from clusternodeheartbeat)
when: snapshotfolder.stat.exists
name: Remove offline jira nodes from clusternodeheartbeat table
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: DELETE FROM clusternodeheartbeat WHERE heartbeat_time=0
when: snapshotfolder.stat.exists
name: Find jira nodes which heartbeat timestamp are older than delete_threshold_hours
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: SELECT node_id FROM clusternodeheartbeat WHERE %(current_epoch_time_ms)s – heartbeat_time > %(delete_threshold_ms)s
named_args:
current_epoch_time_ms: "{{ current_epoch_time_sec | int * 1000}}"
delete_threshold_ms: "{{ delete_threshold_hours | int * 3600 * 1000 }}"
register: old_heartbeat_nodes
when: snapshotfolder.stat.exists
name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternode table
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: "DELETE FROM clusternode WHERE node_id = '{{ item.node_id }}'"
with_items: "{{ old_heartbeat_nodes.query_result }}"
when: snapshotfolder.stat.exists and old_heartbeat_nodes.rowcount > 0
name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternodeheartbeat table
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: "DELETE FROM clusternodeheartbeat WHERE node_id = '{{ item.node_id }}'"
with_items: "{{ old_heartbeat_nodes.query_result }}"
when: snapshotfolder.stat.exists and old_heartbeat_nodes.rowcount > 0
name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from replicatedindexoperation table
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: "DELETE FROM replicatedindexoperation WHERE node_id = '{{ item.node_id }}'"
with_items: "{{ old_heartbeat_nodes.query_result }}"
when: snapshotfolder.stat.exists and old_heartbeat_nodes.rowcount > 0

Challenge 2: Jira node removes the operation entries that itself added into replicatedindexoperation table 2 days ago (this is configurable in System > Services > com.atlassian.jira.service.services.index.ReplicatedIndexCleaningService). As Jira nodes come and go in the auto-scaling group, the Jira node may not have chance to remove the old operation entries. Overtime, the replicatedindexoperation table will be full of stale records. And it could degrade Jira performance. 

Solution: Use a Ansible playbook to remove the old operation entries, and create a cron job to run it once a week. Sample playbook:

name: Delete old jira node operation records from replicatedindexoperation table
postgresql_query:
login_host: "{{ atl_db_host }}"
port: "{{ atl_db_port }}"
login_user: "{{ atl_db_root_user }}"
login_password: "{{ atl_db_root_password }}"
db: "{{ atl_jdbc_db_name }}"
query: DELETE FROM replicatedindexoperation AS Rep WHERE Rep.node_id NOT IN (SELECT node_id from clusternode)

Challenge 3: It is easy to run Jira in a static environment, as each node has a persistent disk for keeping the local index files. Once Jira is moved to AWS, it is quite challenging to make the index cope with a dynamic environment, as the newly launched Jira node has nothing on it is local disk. I have encountered various problems when I firstly run Jira in AWS. For example:

– The first node in the cluster did not have valid index files when it is in the middle of cold deployment.  
– The newly launched failed to get a copy of the index snapshot from other node when it is in the middle of hot deployment.
– A node failed to get a copy of index files from other nodes, Jira did not try any other methods (like restoring from latest backup). This basically leaves that node unusable given index is the foundation of core Jira functions (JQL, board, filter …).
– Zephyr manages it is own index files, and it can be easily broken in a cluster environment.

Solution: Increase the frequency of both Jira and Zephyr index backup (e.g Every 3 hours), and use a Ansible playbook to automatically restore the index files from the latest index backup snapshot that is saved on the shared home folder for any newly launched Jira node. As the delta is only for a couple of hours, Jira node can re-play the missed operations that are recorded in the replicatedindexoperation table. By default, Jira keeps the recent 3 index snapshots, Zephyr keeps the recent 5 index snapshots. In this way, all Jira nodes are ensured in a usable status after startup  And this also keeps the the cold deployment outage window shorter, as the cluster is able to run JQL etc based on the most recent index while one of nodes is re-indexing. The recommendation is to backup both Jira and Zephyr index before any deployments, so the newly launched instance can have the most recent index file from snapshot. Sample playbook:

name: Wait for index files folder ready
wait_for:
path: "{{ atl_product_home }}/caches/indexesV1"
delay: 60
sleep: 10
timeout: 120
ignore_errors: yes
name: Find the latest Jira index snapshot from backup
shell:
cmd: "ls -t {{ atl_product_home_shared }}/export/indexsnapshots/IndexSnapshot* | head -n1"
register: latest_jira_index_snapshot
name: Stop jira service
service:
name: jira.service
state: stopped
name: Recursively remove Jira index files directory
file:
path: "{{ atl_product_home }}/caches/indexesV1"
state: absent
name: Create Jira index files directory
file:
path: "{{ atl_product_home }}/caches/indexesV1"
state: directory
owner: "{{ atl_product_user }}"
group: "{{ atl_product_user }}"
name: Restore Jira index from latest backup snapshot (.zip)
unarchive:
src: "{{ latest_jira_index_snapshot.stdout }}"
dest: "{{ atl_product_home }}/caches/indexesV1"
owner: "{{ atl_product_user }}"
group: "{{ atl_product_user }}"
when: "'.zip' in latest_jira_index_snapshot.stdout"
ignore_errors: yes
name: Restore Jira index from latest backup snapshot (.tar.sz)
shell:
cmd: "/usr/local/bin/snzip -c -d {{ latest_jira_index_snapshot.stdout }} | tar xvf – -C {{ atl_product_home }}/caches/indexesV1"
become: yes
become_user: "{{ atl_product_user }}"
when: "'.tar.sz' in latest_jira_index_snapshot.stdout"
ignore_errors: yes
name: Find the latest Zephyr index snapshot from backup
shell:
cmd: "ls -t {{ atl_product_home_shared }}/zfj/index_backup/IndexSnapshot* | head -n1"
register: latest_zephyr_index_snapshot
name: Create Zephyr index files directory
file:
path: "{{ atl_product_home }}/caches/indexesV1/plugins/JEEntity/schedule"
state: directory
owner: "{{ atl_product_user }}"
group: "{{ atl_product_user }}"
mode: 0755
name: Restore Zephyr index from latest backup snapshot (.zip)
unarchive:
src: "{{ latest_zephyr_index_snapshot.stdout }}"
dest: "{{ atl_product_home }}/caches/indexesV1/plugins/JEEntity/schedule"
owner: "{{ atl_product_user }}"
group: "{{ atl_product_user }}"
when: "'.zip' in latest_zephyr_index_snapshot.stdout"
ignore_errors: yes
name: Start jira service
service:
name: jira.service
state: started
name: Check if jira is ready to serve traffics
uri:
url: http://localhost:8080/status
method: GET
use_proxy: no
register: jira_response
until: jira_response.status == 200
retries: 60
delay: 10
name: Stop jira service
service:
name: jira.service
state: stopped
name: Start jira service
service:
name: jira.service
state: started

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s