engineering

How we cut Atlas backend deploys from 33 minutes to 102 seconds

We moved Atlas backend image builds from GitHub-hosted x64 emulation to a self-hosted native ARM runner on our VPS and cut full deploy time from 32m56s to 1m42s.

How we cut Atlas backend deploys from 33 minutes to 102 seconds

The Atlas backend deployment path went through three versions in quick succession. It started as a tightly coupled CapRover source-build flow, moved to a GitHub-hosted image build that emulated arm64 on an x64 runner, and finally landed on a self-hosted native arm64 runner living on the same VPS as production. The last version is the one that stuck, not because it is clever, but because it finally puts the right work in the right place. GitHub Actions builds the image, GHCR stores the artifact, and CapRover just pulls and runs it.

Comparison of the slow emulated arm64 deploy and the fast native self-hosted arm64 deploy.

The scrappy phase

Originally, GitHub pushed source code to CapRover, and CapRover built the app on the production server. Build and deploy were the same operation. Whether a deploy succeeded depended partly on the CapRover log stream staying alive long enough to report back. Deploys in this phase ranged from about 7m to 19m, and several ended up marked red even when parts of the rollout had actually gone through.

Speed was one problem, but trust was worse. There was no immutable artifact, no separation between a build failure and a rollout failure, and the production machine was doing compilation work during deploy. We also ran into socket hang up errors that turned CI red for no real reason.

The emulation detour

The next version moved the build into GitHub Actions, pushed the resulting image to GHCR, and had CapRover deploy from the image. That was the right architectural move. But the backend server is arm64 and GitHub's hosted runners are x64, so the workflow had to build the arm64 image under QEMU emulation.

It worked, but it was painfully slow. Run 24556328395 clocked in at 32m56s end to end, with 30m56s of that spent on the image build step alone. The deploy path was structurally cleaner, but the feedback loop was still terrible.

The native runner

The fix was simple. We installed a GitHub Actions runner directly on the ARM VPS, ran it as a system service outside of CapRover, and pointed the image-build job at that runner. No more emulation.

Run 24559726174 completed in 1m42s. The image build itself took about 40s. The workflow now builds natively on arm64, pushes to GHCR, and CapRover pulls the image. That is the setup we wanted.

The numbers

Setup	Full deploy	Image build
GitHub-hosted x64 runner, emulated arm64	`32m56s`	`30m56s`
Self-hosted native arm64 runner	`1m42s`	`~40s`

That is a 31m14s reduction end to end, or about 95% faster. The image build step alone dropped by 30m16s.

What actually changed

The speedup did not come from one trick. It came from fixing the deployment contract. We stopped building application source on the rollout path and started producing an immutable image artifact first. We run that build on the same CPU architecture production uses, and we keep the CI runner outside of CapRover so it stays infrastructure rather than becoming another managed app.

That runner placement matters. If we had put it inside CapRover, we would have reintroduced the same coupling we were trying to escape. The runner is a build tool. It does not belong in the application platform.

Beyond speed

The bigger win is that failures are legible now. If the image does not build, GitHub Actions tells us. If the image does not deploy, CapRover tells us. Those used to be tangled together into one opaque result. The image is versioned and immutable in GHCR, which makes rollback straightforward, and production no longer compiles anything during a deploy.

Tradeoffs

This is still an early-stage setup. The build runner and production share the same VPS, which is fine at our current scale but means builds can contend with production for CPU and memory. Docker build cache grows aggressively on a machine that is also running application containers.

We keep runner concurrency at 1, prune the Docker build cache periodically, and keep the build and deploy jobs as separate explicit steps. If we outgrow a single VPS, the runner moves to its own machine and nothing else changes.

The final architecture

Today, GitHub Actions builds a native arm64 image on the self-hosted runner. GHCR stores the image tagged by commit SHA. CapRover deploys that image. The API and worker roll forward independently.

It is not fancy infrastructure. It is just the first version that respects the boundary between building, storing, and running. That was enough to take us from 32m56s to 1m42s.