I really liked this thread by Dino Dai Zovi, so I wanted to save it for easy future reference here.
All ideas and insight shared below are Dino’s of course.
I’ll put some things that I consider good ideas in this thread here. I don’t know your environment, so I don’t have any good advice of what’ll work best for you.
Build systems that hash their inputs to derive the name of the resulting output and cache results in content-addressable storage (CAS) take a little effort to understand, but it’s such a powerful security idea.
Multi-tenant CI systems are the devil. We have better orchestration systems now, so spinning up fresh VMs, micro-VMs, cloud instances, even containers is so much easier than it used to be. Embrace this and externalize anything cached between builds (ideally in a CAS if you can).
Start with understanding that CI is RCE-as-a-service by design. That principle that will help you understand its security and lack thereof. The closer you get to a 1:1 mapping between a hash of all inputs (maybe a git SHA) and resulting artifact hash, the better security will be.
Here is an interesting tidbit from CrowdStrike’s blog post on the SUNSPOT malware that impacted SolarWinds:
“Persistence using scheduled tasks, triggered at boot time”
Why can build nodes reboot without being destroyed? Or did attackers pwn OS image?
So absolutely before any of the fancy CAS stuff I mentioned, start with enforcing a maximum time-to-live for build nodes and have new nodes created from a known-good immutable image (e.g. latest AMI, etc) and install latest OS updates at boot (default for Amazon Linux AMIs).
Here’s the absolutely easy and lazy way: configure an EC2 Auto-Scaling Group for your build nodes that scales down to zero. There ya go, you don’t even need to do anything to kill your nodes now.
In case anyone was wondering, yes I did set all of this up years ago. It took me a few hours on a Saturday afternoon (so not as much need for CI) and I even had time to look at spot price history to tweak the spot price and machine type to save us mad $$$.
Oh yeah, you also need to define a lambda that is called by your build orchestrator to trigger that EC2 ASG to scale up from zero.
This buildkite AWS Cloud Formation stack does it very well
My simplest and most generic advice is having isolated builder pool and artifact caches per project. For example, have separate staging and prod cloud accounts per service with their CI pipelines run separately in those cloud accounts. Then only share caches in same envs.