Overview
With the increasing demands and the advantages of Cloud technologies, moving Jira to Cloud is definitely inevitable regardless it is self-hosted Jira in Cloud (IaaS) or Atlassian Jira Cloud (SaaS). As our current project is to migrate Jira to AWS cloud, I will be focus on the challenges vs solutions to run Jira in AWS.
Here is the typical Atlassian Data Centre application infrastructure (Confluence, Jira, Crowd, BitBucket).
When mapping it to AWS, we found the perfect AWS services for each layer of the architecture. By adopting the following AWS services, all layers are highly scalable and highly available.
- Load Balancer:Application Load Balancer (ALB)
- Cluster Nodes: Auto-Scaling Group (ASG)
- Shared Database: Relational Database Service (RDS)
- Shared File System: Elastic File System (EFS)
With ASG (Auto-Scaling Group), all Jira nodes are short living – the running nodes will be replaced by new nodes when we do cold/hot deployment, AWS decommissions hardware or loses availability zone. However Jira was designed for a static environment which means the Jira nodes are expected to be long living. Most challenges are from this conflict.
Background Knowledge
Let’s start with how Jira tracks nodes in the cluster:
Jira records nodes’ information in the clusternode table, nodes’ heartbeat in clusternodeheartbeat table, nodes’ recent operations in replicatedindexoperation table, node’s latest processed operation in nodeindexcounter table.
With the understanding of the four tables, we can continue with how Jira manages the index files:
In a multiple nodes Jira Data Center cluster, each node keeps the index files locally. e.g When a change is made on one of the node (e.g a new issue is created), the node index that change and also adds an entry to the replicatedindexoperation table and removes it later (by default it is 2 days). So that other nodes can re-play the operation. That means at any given point of time, the index files could be inconsistent across all nodes. This is so called eventual consistency.
When a new Jira node joins the cluster, if it is the first node in the cluster it needs re-indexing or index restore when it does not have valid index files locally. If it not the first node, it queries the database to find out which node has the latest operation. Then the new node sends a request to that node asking for a copy of the index files, but it does not validate the health of the index files on the responding node. The responding node takes a snapshot of its index files and copies the snapshot file to the shared home folder, so the new Jira node can restore from it. If the new node is unable to restore the index files from other nodes (e.g due to timeout or corrupted index files), it won’t restore from the index files backup nor triggers a re-indexing automatically.
When an old Jira node re-joins the cluster, it will compare its local index files with the latest that is recorded in database. If the delta is less than two days, it will replay the operations that are kept in the replicatedindexoperation table. If the delta are more than two days, it will send request to other nodes to get a copy of the latest index files.
Also it is worth mentioning that Jira node will push the index files to other nodes after it has done a re-indexing. Besides Jira index files, some plugins manage its own index files e.g Zephyr.
NOTE: We are currently running Jira 8.5.8. The newer version of Jira (since 8.12) has introduced DBR (Document Based Replication) which mitigates apps’ impact on indexing time and prevents index inconsistencies in Jira Data Center cluster.
Challenges vs Solutions
Challenge 1: Before version 8.10.0, Jira does not remove stale nodes automatically. As Jira nodes come and go in the auto-scaling group, overtime there will be lots of stale nodes are in the database. And the active nodes will try to run replication with those nodes which actually do not exist anymore. This is a waste of resources, and can degrade Jira performance.
Solution: Use a Ansible playbook to remove the stale nodes automatically. The delete_threshold_hours is configurable, currently we set it as 1 hour. Sample playbook:
Challenge 2: Jira node removes the operation entries that itself added into replicatedindexoperation table 2 days ago (this is configurable in System > Services > com.atlassian.jira.service.services.index.ReplicatedIndexCleaningService). As Jira nodes come and go in the auto-scaling group, the Jira node may not have chance to remove the old operation entries. Overtime, the replicatedindexoperation table will be full of stale records. And it could degrade Jira performance.
Solution: Use a Ansible playbook to remove the old operation entries, and create a cron job to run it once a week. Sample playbook:
Challenge 3: It is easy to run Jira in a static environment, as each node has a persistent disk for keeping the local index files. Once Jira is moved to AWS, it is quite challenging to make the index cope with a dynamic environment, as the newly launched Jira node has nothing on it is local disk. I have encountered various problems when I firstly run Jira in AWS. For example:
– The first node in the cluster did not have valid index files when it is in the middle of cold deployment.
– The newly launched failed to get a copy of the index snapshot from other node when it is in the middle of hot deployment.
– A node failed to get a copy of index files from other nodes, Jira did not try any other methods (like restoring from latest backup). This basically leaves that node unusable given index is the foundation of core Jira functions (JQL, board, filter …).
– Zephyr manages it is own index files, and it can be easily broken in a cluster environment.
Solution: Increase the frequency of both Jira and Zephyr index backup (e.g Every 3 hours), and use a Ansible playbook to automatically restore the index files from the latest index backup snapshot that is saved on the shared home folder for any newly launched Jira node. As the delta is only for a couple of hours, Jira node can re-play the missed operations that are recorded in the replicatedindexoperation table. By default, Jira keeps the recent 3 index snapshots, Zephyr keeps the recent 5 index snapshots. In this way, all Jira nodes are ensured in a usable status after startup And this also keeps the the cold deployment outage window shorter, as the cluster is able to run JQL etc based on the most recent index while one of nodes is re-indexing. The recommendation is to backup both Jira and Zephyr index before any deployments, so the newly launched instance can have the most recent index file from snapshot. Sample playbook:
References
- Jira Data Center search indexing
- Remove abandoned or offline nodes in JIRA Data Center
- JRASERVER-42916 Stale node ids should automatically be removed in Jira Data Center
- JRASERVER-62669 Automatic restore of indexes will fail if the node that registered the latest index operation is unavailable
- JRASERVER-66550 JIRA Datacenter – Add additional Lucene index checks before propagating index to other nodes
- JRASERVER-68437 Perform an integrity check when creating an index backup
- Document-based replication in Jira Data Center