Modern Jenkins for Unreal Engine

Posted at — Oct 10, 2021

Game engines provide a lot of things out of the box - but CI systems are typically not one of them. I have always found this annoying. Setting up a satisfactory CI solution for game projects tend to take a lot of effort.

We have used Jenkins as the CI system for our Unreal Engine project for two years now. It is not fancy, but it does the job. It started out as a single Windows desktop, but has grown to four machines plus a VM, each with slightly different versions of the installed tools, etc. In short, it is a mess. It has taken us this far, but it is not a viable strategy to ship a 50+ person, 2+ year game on.

What options are there? How should a reasonable CI system work? We have spent the past 1.5 years investigating a lot of different options, and we have finally ended up with something useful. Check it out on GitHub.

Important criteria

No point & click configuration of the CI system. Everything comes from config files in repos. Machine configuration comes from config files / Docker images / VM images.

Build machines can preserve hundreds of GB of state, and do incremental builds.

If run in the cloud, we don’t want to pay to have machines running 24h/day since there is only activity 40h/week (= 25% of the week).

Developers should get quick feedback when they have made a bad commit.

Running physical machinery is not our core business. If there is a cost-effective cloud solution, we’ll take it. We can spend lots of time writing & improving automation though.

Why Jenkins?

A lot has happened in the CI space during the past decade. Unfortunately, many newer services are either Linux-only, container-only, Git-only, Kubernetes-centric, or pricey closed-source software. None of them offer the flexibility of Jenkins.

TeamCity and Buildkite are worthy contenders. They offer a higher-quality experience than Jenkins, but, they are not enough of an improvement to us to motivate moving away from open-source software.

An aside: GitHub Actions with self-hosted runners

We kept UE in GitHub, and our game in Plastic SCM. We used GitHub Actions with self-hosted runners in GCE for creating installed Engine builds. This worked well… except for many reasons:

GHA does not officially support on-demand runners, they’d get de-registered after 30 days of inactivity, we needed to rely on undocumented behaviour (which regularly broke) to do the on-demand thing, etc. This also meant that the dev team needed to be comfortable with two different build systems.

GHA was helpful, but it became apparent that we wanted to build UE using the same CI that we used for the game.

Jenkins controller on GKE: yes!

Google Cloud provides not only GCE, but also GKE. There is a Helm chart for Jenkins. This turned out to be a very good starting point: it is a well-designed way to configure Jenkins from files. It also offers a standard way to deploy it onto Kubernetes.

Combine this with a home-built Docker image with the appropriate plugins (just make sure to version-lock all plugin dependencies), and this is a complete declarative configuration of the controller. Excellent!

With the controller sorted, there’s just the problem of how to run the agents. Hmm…

Jenkins agents on GKE: nope

While it is possible to run agents on GKE, it is not a good choice for UE / games. Why is that?

The obvious way to run Jenkins on GKE in a cost-effective manner is to have one pool with Windows nodes, another pool with Linux nodes, etc, and dynamically scale up/down the number of nodes depending on demand. This strategy falls apart due to several reasons.

The first is that we use large Docker images for Windows jobs. When a new GKE node is provisioned, its Docker image cache will be empty. Pulling the appropriate Docker agent image + build tools image takes 15-25 minutes! Most of this time is not network transfer delay; it is the OS or Docker daemon doing various preparations. There is no good way to hook into this part of the GKE node provisioning either.

The second is that Jenkins doesn’t quite understand that a GKE node spends time between launch and beginning processing of a job; it considers a node that is pulling Docker images to be idle. In order to make Jenkins wait up to 30 minutes for the node to get provisioned and the job to start, the idle timeout must be set to 30+ minutes. This means that the node will always remain active at least 30 minutes after the last build has been completed.

The third is that we need to bind each job’s workspace to a PVC to retain state between runs. Handling these PVCs is something we’d need to do outside of Jenkins, via Terraform (well, unless we use CRDs, but we didn’t know about that back then). We’d need to design a systematic solution for PVC life cycle handling, it doesn’t come out of the box with Jenkins nor GKE.

Jenkins agents on GCE, in Docker: nope

How about running the agent within Docker containers, on VMs? It seems like a good idea at first, but in practice it becomes unwieldy.

Everything becomes a bit harder to debug. Windows consoles are not full-blown TTYs; if you connect via WinRM to a Windows machine, you can’t run a Docker container with an interactive terminal without magic incantations. If a program within the container tries to access the GCP metadata IP address (169.254.169.254) and you haven’t explicitly forwarded it, the command will hang indefinitely. For Linux, you will have UID/GID pains when sharing files between the host and the container.

The start delay time for agents are unpredictable: there’s a 15-20 minute image fetch delay when the Docker image cache is empty, just like on GKE. There are more opportunities for caching these results than with GKE, but still – if we provision VMs on-demand, a job will occasionally take 15+ minutes to get started, with no indication in the Jenkins GUI of what’s happening – it looks like something has hung or gone wrong.

Jenkins agents on GCE, in raw VMs: yes!

If we build VM images that contain all necessary software, then we eliminate most pain points of the Docker VMs. These provision and boot in 60-90 seconds. Build jobs on these VMs behave similarly to how they do on developers’ local workstations.

There is a plugin that runs on-demand provisioned agents on GCE. It is a good starting point, but how do we retain state between successive runs?

We extended the GCE plugin to support persisting agents: when a VM no longer is needed, it is not deleted – it is merely stopped. The CPU and RAM is released but the disk is kept. This lowers the cost of inactive agents drastically. Starting a stopped agent takes 45-60 seconds. This is “good enough” for our users.

Maintainability, performance and cost

The biggest benefit so far is that we have control over our CI system. There is no magic, manual configuration on any machine.

The second-biggest benefit is that we can quickly add and remove jobs without worrying about capacity; it’s just a question of $$$.

Regarding performance – the VMs have lots of RAM, decent CPUs, and a bit slower disks than physical machines.

When it comes to cost, this is indeed pricey. $1700/month to support a 50-man team is more expensive than self-hosting… but compared to employee salaries it is not totally out of line.

In short: This beats a poorly-maintained on-premises system. It is an important stepping stone to a well-maintained on-premises system. A well-maintained on-premises system will beat this.

The future: Jenkins Operator + CRDs

Terraform makes it easy to configure core infrastructure. The Helm makes it easy to deploy the Jenkins controller. Config changes and version updates are still complicated. Config Connector will allow agent VM templates to be controlled from within Kubernetes.

The Jenkins Operator might provide a good standard for how to apply config changes to Jenkins in a safe manner, without interrupting existing activity (like the current Helm chart does). This will hopefully simplify how to do upgrades while the CI system is active.

The future: Jenkins agents on on-premises hardware

Now that we have automated creation of Docker containers and VM images, we are in a much better position to set up on-premises machines. There’s a lot for us to figure out here. Should we virtualize the machines, or go bare metal? Do we need separate Windows vs Linux solutions? What about Windows licenses? Will we need local caches, object storage, etc to minimize network traffic charges? So many questions!

Closing words

It takes a long time to develop CI solutions. If you want your team to have a good development experience, you must work proactively on CI. Make sure the CI is a help instead of a hindrance; it is difficult enough to make games anyway.

Do you want to work on these sorts of problems? We are hiring! Or, reach out to @Kalmalyzer on Twitter for a chat!

Fall Damage Blog