Overview

With the increasing demands and the advantages of Cloud technologies, moving Jira to Cloud is definitely inevitable regardless it is self-hosted Jira in Cloud (IaaS) or Atlassian Jira Cloud (SaaS). As our current project is to migrate Jira to AWS cloud, I will be focus on the challenges vs solutions to run Jira in AWS.

Here is the typical Atlassian Data Centre application infrastructure (Confluence, Jira, Crowd, BitBucket).

When mapping it to AWS, we found the perfect AWS services for each layer of the architecture. By adopting the following AWS services, all layers are highly scalable and highly available.

Load Balancer:Application Load Balancer (ALB)
Cluster Nodes: Auto-Scaling Group (ASG)
Shared Database: Relational Database Service (RDS)
Shared File System: Elastic File System (EFS)

With ASG (Auto-Scaling Group), all Jira nodes are short living – the running nodes will be replaced by new nodes when we do cold/hot deployment, AWS decommissions hardware or loses availability zone. However Jira was designed for a static environment which means the Jira nodes are expected to be long living. Most challenges are from this conflict.

Background Knowledge

Let’s start with how Jira tracks nodes in the cluster:

Jira records nodes’ information in the clusternode table, nodes’ heartbeat in clusternodeheartbeat table, nodes’ recent operations in replicatedindexoperation table, node’s latest processed operation in nodeindexcounter table.

With the understanding of the four tables, we can continue with how Jira manages the index files:

In a multiple nodes Jira Data Center cluster, each node keeps the index files locally. e.g When a change is made on one of the node (e.g a new issue is created), the node index that change and also adds an entry to the replicatedindexoperation table and removes it later (by default it is 2 days). So that other nodes can re-play the operation. That means at any given point of time, the index files could be inconsistent across all nodes. This is so called eventual consistency.

When a new Jira node joins the cluster, if it is the first node in the cluster it needs re-indexing or index restore when it does not have valid index files locally. If it not the first node, it queries the database to find out which node has the latest operation. Then the new node sends a request to that node asking for a copy of the index files, but it does not validate the health of the index files on the responding node. The responding node takes a snapshot of its index files and copies the snapshot file to the shared home folder, so the new Jira node can restore from it. If the new node is unable to restore the index files from other nodes (e.g due to timeout or corrupted index files), it won’t restore from the index files backup nor triggers a re-indexing automatically.

When an old Jira node re-joins the cluster, it will compare its local index files with the latest that is recorded in database. If the delta is less than two days, it will replay the operations that are kept in the replicatedindexoperation table. If the delta are more than two days, it will send request to other nodes to get a copy of the latest index files.

Also it is worth mentioning that Jira node will push the index files to other nodes after it has done a re-indexing. Besides Jira index files, some plugins manage its own index files e.g Zephyr.

NOTE: We are currently running Jira 8.5.8. The newer version of Jira (since 8.12) has introduced DBR (Document Based Replication) which mitigates apps’ impact on indexing time and prevents index inconsistencies in Jira Data Center cluster.

Challenges vs Solutions

Challenge 1: Before version 8.10.0, Jira does not remove stale nodes automatically. As Jira nodes come and go in the auto-scaling group, overtime there will be lots of stale nodes are in the database. And the active nodes will try to run replication with those nodes which actually do not exist anymore. This is a waste of resources, and can degrade Jira performance.

Solution: Use a Ansible playbook to remove the stale nodes automatically. The delete_threshold_hours is configurable, currently we set it as 1 hour. Sample playbook:

	—

	– name: Check if this is a new cluster that has not been initialized
	stat:
	path: "{{ atl_product_home_shared }}/export/indexsnapshots"
	register: snapshotfolder

	– name: Remove offline jira nodes from clusternode table
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: DELETE FROM clusternode WHERE node_state='OFFLINE'
	when: snapshotfolder.stat.exists

	– name: Remove no heartbeat record jira node from clusternode table
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: DELETE FROM clusternode AS HB WHERE HB.node_id NOT IN (SELECT node_id from clusternodeheartbeat)
	when: snapshotfolder.stat.exists

	– name: Remove offline jira nodes from clusternodeheartbeat table
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: DELETE FROM clusternodeheartbeat WHERE heartbeat_time=0
	when: snapshotfolder.stat.exists

	– name: Find jira nodes which heartbeat timestamp are older than delete_threshold_hours
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: SELECT node_id FROM clusternodeheartbeat WHERE %(current_epoch_time_ms)s – heartbeat_time > %(delete_threshold_ms)s
	named_args:
	current_epoch_time_ms: "{{ current_epoch_time_sec \| int * 1000}}"
	delete_threshold_ms: "{{ delete_threshold_hours \| int * 3600 * 1000 }}"
	register: old_heartbeat_nodes
	when: snapshotfolder.stat.exists

	– name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternode table
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: "DELETE FROM clusternode WHERE node_id = '{{ item.node_id }}'"
	with_items: "{{ old_heartbeat_nodes.query_result }}"
	when: snapshotfolder.stat.exists and old_heartbeat_nodes.rowcount > 0

	– name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from clusternodeheartbeat table
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: "DELETE FROM clusternodeheartbeat WHERE node_id = '{{ item.node_id }}'"
	with_items: "{{ old_heartbeat_nodes.query_result }}"
	when: snapshotfolder.stat.exists and old_heartbeat_nodes.rowcount > 0

	– name: Delete jira nodes which heartbeat timestamp are older than delete_threshold_hours from replicatedindexoperation table
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: "DELETE FROM replicatedindexoperation WHERE node_id = '{{ item.node_id }}'"
	with_items: "{{ old_heartbeat_nodes.query_result }}"
	when: snapshotfolder.stat.exists and old_heartbeat_nodes.rowcount > 0

view raw jira_remove_stale_node.yml hosted with ❤ by GitHub

Challenge 2: Jira node removes the operation entries that itself added into replicatedindexoperation table 2 days ago (this is configurable in System > Services > com.atlassian.jira.service.services.index.ReplicatedIndexCleaningService). As Jira nodes come and go in the auto-scaling group, the Jira node may not have chance to remove the old operation entries. Overtime, the replicatedindexoperation table will be full of stale records. And it could degrade Jira performance.

Solution: Use a Ansible playbook to remove the old operation entries, and create a cron job to run it once a week. Sample playbook:

	—

	– name: Delete old jira node operation records from replicatedindexoperation table
	postgresql_query:
	login_host: "{{ atl_db_host }}"
	port: "{{ atl_db_port }}"
	login_user: "{{ atl_db_root_user }}"
	login_password: "{{ atl_db_root_password }}"
	db: "{{ atl_jdbc_db_name }}"
	query: DELETE FROM replicatedindexoperation AS Rep WHERE Rep.node_id NOT IN (SELECT node_id from clusternode)

view raw jira_remove_old_operations.yaml hosted with ❤ by GitHub

Challenge 3: It is easy to run Jira in a static environment, as each node has a persistent disk for keeping the local index files. Once Jira is moved to AWS, it is quite challenging to make the index cope with a dynamic environment, as the newly launched Jira node has nothing on it is local disk. I have encountered various problems when I firstly run Jira in AWS. For example:

– The first node in the cluster did not have valid index files when it is in the middle of cold deployment.
– The newly launched failed to get a copy of the index snapshot from other node when it is in the middle of hot deployment.
– A node failed to get a copy of index files from other nodes, Jira did not try any other methods (like restoring from latest backup). This basically leaves that node unusable given index is the foundation of core Jira functions (JQL, board, filter …).
– Zephyr manages it is own index files, and it can be easily broken in a cluster environment.

Solution: Increase the frequency of both Jira and Zephyr index backup (e.g Every 3 hours), and use a Ansible playbook to automatically restore the index files from the latest index backup snapshot that is saved on the shared home folder for any newly launched Jira node. As the delta is only for a couple of hours, Jira node can re-play the missed operations that are recorded in the replicatedindexoperation table. By default, Jira keeps the recent 3 index snapshots, Zephyr keeps the recent 5 index snapshots. In this way, all Jira nodes are ensured in a usable status after startup And this also keeps the the cold deployment outage window shorter, as the cluster is able to run JQL etc based on the most recent index while one of nodes is re-indexing. The recommendation is to backup both Jira and Zephyr index before any deployments, so the newly launched instance can have the most recent index file from snapshot. Sample playbook:

	—

	– name: Wait for index files folder ready
	wait_for:
	path: "{{ atl_product_home }}/caches/indexesV1"
	delay: 60
	sleep: 10
	timeout: 120
	ignore_errors: yes

	– name: Find the latest Jira index snapshot from backup
	shell:
	cmd: "ls -t {{ atl_product_home_shared }}/export/indexsnapshots/IndexSnapshot* \| head -n1"
	register: latest_jira_index_snapshot

	– name: Stop jira service
	service:
	name: jira.service
	state: stopped

	– name: Recursively remove Jira index files directory
	file:
	path: "{{ atl_product_home }}/caches/indexesV1"
	state: absent

	– name: Create Jira index files directory
	file:
	path: "{{ atl_product_home }}/caches/indexesV1"
	state: directory
	owner: "{{ atl_product_user }}"
	group: "{{ atl_product_user }}"

	– name: Restore Jira index from latest backup snapshot (.zip)
	unarchive:
	src: "{{ latest_jira_index_snapshot.stdout }}"
	dest: "{{ atl_product_home }}/caches/indexesV1"
	owner: "{{ atl_product_user }}"
	group: "{{ atl_product_user }}"
	when: "'.zip' in latest_jira_index_snapshot.stdout"
	ignore_errors: yes

	– name: Restore Jira index from latest backup snapshot (.tar.sz)
	shell:
	cmd: "/usr/local/bin/snzip -c -d {{ latest_jira_index_snapshot.stdout }} \| tar xvf – -C {{ atl_product_home }}/caches/indexesV1"
	become: yes
	become_user: "{{ atl_product_user }}"
	when: "'.tar.sz' in latest_jira_index_snapshot.stdout"
	ignore_errors: yes

	– name: Find the latest Zephyr index snapshot from backup
	shell:
	cmd: "ls -t {{ atl_product_home_shared }}/zfj/index_backup/IndexSnapshot* \| head -n1"
	register: latest_zephyr_index_snapshot

	– name: Create Zephyr index files directory
	file:
	path: "{{ atl_product_home }}/caches/indexesV1/plugins/JEEntity/schedule"
	state: directory
	owner: "{{ atl_product_user }}"
	group: "{{ atl_product_user }}"
	mode: 0755

	– name: Restore Zephyr index from latest backup snapshot (.zip)
	unarchive:
	src: "{{ latest_zephyr_index_snapshot.stdout }}"
	dest: "{{ atl_product_home }}/caches/indexesV1/plugins/JEEntity/schedule"
	owner: "{{ atl_product_user }}"
	group: "{{ atl_product_user }}"
	when: "'.zip' in latest_zephyr_index_snapshot.stdout"
	ignore_errors: yes

	– name: Start jira service
	service:
	name: jira.service
	state: started

	– name: Check if jira is ready to serve traffics
	uri:
	url: http://localhost:8080/status
	method: GET
	use_proxy: no
	register: jira_response
	until: jira_response.status == 200
	retries: 60
	delay: 10

	– name: Stop jira service
	service:
	name: jira.service
	state: stopped

	– name: Start jira service
	service:
	name: jira.service
	state: started

view raw jira_restore_index_from_snapshot.yaml hosted with ❤ by GitHub

	Levon Ritter on AWS DataSync vs S3 Sync
	Joe on AWS Bedrock AgentCore: Enterpr…
	ABDUL YASEEN BABA MO… on TSM
	Heather W on Puppet push Nagios
	Umesh Kumar on Yum gets ‘HTTPS Error 40…
	Pavel on Check Confluence team calendar…
	withanHdammit on Renew AWS credential for a lon…
	Unleashing the Power… on Image-Reader: A project to exp…
	Bob on Build docker image with kaniko…
	Voces De La Tierra on Puppet for Windows: Remote…

Run Jira on AWS – Challenges vs Solutions

Overview

Background Knowledge

Challenges vs Solutions

References

Published by Jackie Chen

Leave a comment Cancel reply

Overview

Background Knowledge

Challenges vs Solutions

References

Share this:

Related

Published by Jackie Chen

Leave a comment Cancel reply