1. Introduction

Continuous Integration, also known as CI, is an important part of modern software development. In fact it is a real game changer when Continuous Integration is introduced into an organisation, since it radically alters the way teams think about the whole development process. It has a great potential to enhance the development process from continuous build integration to continuous deployment. A good CI infrastructure can streamline the development process right through to deployment, help detect and fix bugs faster, provide a useful project dashboard for both developers and non-developers, and ultimately, help teams deliver more real business value to the end user. In short, Jenkins monitors your version control system for changes. Whenever a change is detected, it automatically compiles and builds the application. If something goes wrong, it immediately notifies the developers so that they can fix the issue immediately.

2. Why Jenkins in the cloud?

Traditionally, Jenkins master and agents run on a dedicated server and are available only on a company intranet. In this setup, we have a fixed number of agents, and therefore it is not scalable. The diagram below shows the traditional setup of Jenkins in an organisation. Jenkins talks to the gitlab server which is also hosted internally and available on the same company intranet. Now, if Jenkins is running on a dedicated server, what happens when it fails? (Remember, everything fails sooner or later.)

Below picture shows the traditional Jenkins setup in an organisation

2.1 Jenkins Architecture

Jenkins is a monolithic application based on a combination of a master and slaves.

The Jenkins master monitors sources, triggers jobs when predefined conditions are met, publishes logs and artifacts. It does not run actual tasks but makes sure that they are executed. The Jenkins agent/slaves, on the other hand, do the actual work. When the master triggers a job execution, the actual work is performed by an agent.

We cannot scale a Jenkins master. We can create multiple Jenkins masters, but they cannot share the same file systems. Since Jenkins uses files to store its state, creating multiple instances would result in completely separate applications. Since the main reasons behind scaling are fault tolerance and performance benefits, none of those goals would be accomplished by scaling the Jenkins master.

But if Jenkins cannot be scaled, how do we meet performance requirements? Well, we can increase the capacity by adding agents. A single master can handle many agents. In most cases, an agent is a whole server (physical or virtual). It is not uncommon for a single master to have tens or even hundreds of agents (servers). In turn, each of those agents runs multiple executors that run tasks.

2.2 How is this limiting your development process?

Jenkins is mission critical in many development projects. Release management is blocked if the Jenkins server is down. The speed of development is unpredictable and provisioning for the peak load while provisioning slaves is a waste of a lot resources given that for a team that’s not globally spread. These Jenkins slaves are utilized only for the office working hours typically 8-10 hours a day.

AWS-on-demand slaves are cheap and are good for the role of slaves. Demand slaves are not recommended for performance tests only. Cost of provisioning a new slave is too high (someone has to prepare a box by installing whole lot of tools and software needed)

2.3 What can fail?

Everything fails sooner or later so we should be ready for failure recovery. As we saw Jenkins agents are spinned up and terminated when job execution is completed. The Jenkins master needs to run all the time in order to keep Jenkins up and running. Hence, the Jenkins master is a single point of failure.

Let’s discuss some failure scenarios:

  1. Regular backups can be taken for the master but, there is no guarantee that backup will succeed. Any backup is unreliable.
  2. Cost of recovery is high. Even though backups are there, someone needs to manually recover the backup and setup a new Jenkins master. This is a time consuming task and the whole development team is blocked because of it.
  3. You need to do a whole lot of over-provisioning for a given throughput. Any spike in use makes the quality of service bad and can potentially make it unusable.

3. How can a reliable and self-healing Jenkins environment be setup?

As discussed in the previous section, the traditional Jenkins environment has many disadvantages and poses risks when we lose the data. In the following sections, we discuss the guiding principles of a self healing, reliable and scalable Jenkins environment that runs in the AWS cloud.

3.1 Advantage of going to the cloud?

The following are some of the many benefits of running a Jenkins environment in the cloud.

  • On-demand Jenkins slaves
  • Reliable managed environment
  • No manual management of slaves
  • Highly (if not infinitely) scalable
  • Reliable EBS encrypted volume
  • Automatic restore from EBS
  • AWS provides a variety of instance types, use instance types to define various slaves labels. And use the slave best suited to the job without over or under provisioning computing resources.

3.2 Architecture

Below picture shows the architecture diagram of self-healing Jenkins in the cloud. Unlike traditional Jenkins, we get on-demand slaves as and when new build requests are in the queue. This makes Jenkins highly scalable.

3.3 Self-healing

Self-healing Jenkins means that in any unforeseen circumstance the Jenkins master is terminated, a new master comes up automatically and all the global configurations, build jobs and state are restored without losing any data.

In self-healing Jenkins, we advise having a two tier recovery plan

  • EBS – aws encrypted
  • Backup in S3 bucket

3.4 Implementation Details

A self-healing Jenkins is automated using cloudformation template. This template consists of 3 main section parameters, resources, conditions and output. I will walk you through the important snippet of the code which explains how we automated self-healing Jenkins


When you spin up the Jenkins for the first time, EBS volume “EBSVolumeID” should be created by passing “create” as a default value. “EBSVolumeSize” Should be with the value depending on the EBS volume size requirement.

"EBSVolumeID": {
"Default": "create",
"Description": "'create', or an existing EBS volume ID to use instead of creating a new one.",
"AllowedPattern": "(create|vol-[a-z0-9]+)",
"Type": "String"
"EBSVolumeSize": {
"Default": "120",
"Description": "The size of the EBS volume to create (in GB), if EBSVolumeID is set to 'create'.",
"AllowedPattern": "[0-9]+",
"Type": "String"


For EBSVolumeID, below condition is checked. For the very first time it has to satisfy below condition true by passing “create” to EBSVolumeID.

"Conditions": {
"CreateEBSVolume": {"Fn::Equals": [{"Ref": "EBSVolumeID"}, "create"]}

If you already have EBS volume and you want to spin up jenkins master from this existing EBS volume then you should pass the EBS volume ID to this “EBSVolumeID”


Encrypted EBS volume is created using below snippet

"EBSSSDVolume": {
"Type": "AWS::EC2::Volume",
"Properties": {
"VolumeType": "gp2",
"Encrypted": true,
"Size": {"Ref": "EBSVolumeSize"},
"AvailabilityZone": {"Ref": "AvailabilityZone"},
"Tags": [{"Key": "Name", "Value": {"Fn::Join": ["", [{"Ref": "ENV"}, "-jenkins-volume"]]}}]},
"Condition": "CreateEBSVolume"

Userdata for Launch Configuration

This section explains the actual logic of spinning up a Jenkins master machine.What we do here:

  • Boot strapping in user data
  • Softwares/plugins are installed automatically during bootstrap
  • Dockerized jenkins master
  • EBS volume for persistent data storage

UserData": {
"Fn::Base64": {
"Fn::Join": [
"#!/bin/bash -xe\n",
"# The docker file\n",
"cat <<-EOF >/home/ubuntu/Dockerfile\n",
"FROM jenkins/jenkins:2.73.1\n",
"# Install useful software.\n",
"USER root\n",
"RUN apt-get update && apt-get install -y duplicity python-boto git-review jq && rm -rf /var/lib/apt/lists/*\n", "\n",
"Fn::If": [
"Ref": "EBSSSDVolume"
"Ref": "EBSVolumeID"
"INSTANCE_ID=$(ec2metadata --instance-id)\n",
"echo \"Attaching EBS volume ${EBS_VOLUME_ID} to device ${EBS_VOLUME_DEVICE} on instance ${INSTANCE_ID}\"\n",
"aws --region ",{"Ref": "AWS::Region"}," ec2 attach-volume --volume-id ${EBS_VOLUME_ID} --instance-id ${INSTANCE_ID} --device ${EBS_VOLUME_DEVICE}\n",
"echo \"Waiting for attach to complete...\"\n",
"let RETRIES_LEFT=60\n",
"while [[ \"$RETRIES_LEFT\" -gt \"0\" ]]; do\n",
" sleep 1s\n",
" ATTACH_STATUS=$(aws --region ",{"Ref": "AWS::Region"}," ec2 describe-volumes --volume-ids ${EBS_VOLUME_ID} | jq .Volumes[0].Attachments[0].State)\n",
" if [ \"${ATTACH_STATUS}\" == '\"attached\"' ]; then\n",
" echo \"Attach completed.\"\n",
" break \n",
" fi\n",
" echo \"Current status: ${ATTACH_STATUS}.\"\n",
"set +e\n",
"sudo file -s ${EBS_VOLUME_DEVICE} | cut -d , -f1 | grep -q \"ext4\"\n",
"if [ $? -eq 0 ]; then\n",
" echo \"Already data in the EBS volume ...\"\n",
" sudo mkfs -t ext4 ${EBS_VOLUME_DEVICE}\n",
"set -e\n",
"sudo mkdir ${MOUNT_POINT}\n",
"sudo mount ${EBS_VOLUME_DEVICE} ${MOUNT_POINT}\n",
"# The plugins.txt file\n",
"cat <<-EOF >/home/ubuntu/plugins.txt\n",
"sudo apt-get update\n",
"sudo apt-get install -qy python-pip apt-transport-https curl ca-certificates software-properties-common\n",
"curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n",
"sudo add-apt-repository \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n",
"sudo apt-get update\n",
"sudo apt-get install -qy docker-ce\n",
"sudo usermod -aG docker $USER\n",
"sudo docker build -t jenkins_custom_made:v1 /home/ubuntu\n",
"sudo mkdir -p ${MOUNT_POINT}/jenkins_home\n",
"sudo chown ubuntu:ubuntu ${MOUNT_POINT}/jenkins_home\n",
"sudo docker run --restart=always -d -p 8080:8080 -p 50000:50000 -v ${MOUNT_POINT}/jenkins_home:/var/jenkins_home --name=jenkins jenkins_custom_made:v1\n",
"sudo add-apt-repository -y ppa:duplicity-team/ppa\n",

Selfhealing Jenkins complete automated cloudformation template is available at – https://github.com/bsarbhukan/self-healing-jenkins

3.5 Security

Security is the first and the foremost important thing these days and it is a concern when intellectual property is hosted in the cloud. The following can be used as guiding principles while designing for security, 

  • Principle of least privilege is followed. Using security groups, only required sources are granted with access. Access Control is the  primary mechanism for securing a Jenkins environment against unauthorized usage.
  • Identity and permissions are managed using IAM Roles and IAM policies
  • Network ACLs
  • Jenkins master is a private node and it doesn’t have a public IP address
  • Integration with LDAP or Active directory
  • Access level based on matrix
  • Encrypted volume

4. Way Forward

Currently an initial version of self healing Jenkins is implemented. But, we see the possibility to extend it further with some more advanced feature implementation. Some of the cool features we would like to have are –

  • Using Jenkins 2. The Job configuration is part of the code. Jobs and infrastructure are managed as code. They go through the same review process as normal code.
  • Security can be further strengthened by encrypting even at the instance level (apart from EBS provided encryption) using LUKS.
  • Enhancing security at the instance level through IP table based firewall, this can be useful in case AWS security groups a bypassed by a hacker. Multiple levels of security defence helps.
  • Monitoring the Jenkins – Possibly Nagios or something else to monitor that the whole Jenkins infrastructure is running smoothly.

5. Conclusion

This is only reference design and implementation. Every project has its own requirement and customization. Readers discretion needed 🙂

These principles can be extended to other cloud infrastructure providers such as Azure, GCP and also to the internally managed company cloud based on Open stack.