9 minute read

On this page: Taking the information collected and applying it to some common risk scenarios to help define the overall risk of your cluster.

Determining the Risk

Measuring “Risk” is its own topic and many organizations have already made their choices on how to Assess Risk within their organizations.

I like to cite Magoo Simple Risk Measurement as a great first start into analyzing risk.

I’m not going to go deeply into this model, but with this information, you can start working towards performing risk analysis and managing your organization’s overall risk when it comes to its Kubernetes clusters.

Risk Scenario: A Pod Is Compromised

Evidence

If a bug doesn’t exist in an application, we’d all be out of jobs. I’ve witnessed multitudes of examples where a Pod compromise causes damage to the organization in some way. When this occurs, this is the moment that Kubernetes’ security comes into play.

My example has always been this: Imagine you have to run an outdated version of ImageMagick and grant unauthenticated users permission to upload files for it to process. How will your Kubernetes cluster hold up?

Outcomes

An application has a remote code execution vulnerability and a Pod is compromised. From there an attacker is able to steal account tokens inside the Pod. You need to ask:

  • What can they do with the Kubernetes API?
  • Are there any Namespace boundaries that would prevent an attacker from accessing other objects?
  • What network restrictions would prevent them from lateral movement within the pod?
  • What secrets or credentials are exposed to the attacker?

Decision

There are a lot of controls for preventing an attacker from pivoting. That could include:

  • An admission controller like OPA Gatekeeper that restricts the types of Pods that can be created
  • A CNI used with Network Policies to block a service from accessing other adjacent subnets or namespaces.
  • A multi-tenancy architecture that uses namespaces

Tales from the Industry

One of those large organizations “with a startup mentality” needed to execute a very simple microservice that would take a PCAP from a customer and count the number of packets. This was done by using Wireshark’s packet processor. Wireshark is a great tool but is known to have some CVE’s. If you’re trying to reduce the risk and you can’t reduce the exploitability, all you can do is reduce the impact. You can do this by:

  1. Restricting the workload from accessing network services. Setup cloud network policies for clusters, disable the metadata service, use Kubernetes Network Policies to prevent pod-to-pod communication outside of what the workload needs.
  2. Securing the runtime which in most cases means putting very strict constraints on the Pod itself. If it really was ImageMagick, we could likely drop all capabilities, have strict storage constraints, and define clear resource restrictions. If you wanted to get crazy, you could even build a custom seccomp profile that defines what syscalls it should make. Something that’s becoming more common since we can start trusting virtualization-based runtimes, is to use an alternative runtime such as gVisor that is designed to sandbox the process much more than Containerd.
  3. Isolating the workload within the cluster. By using Node Pools with Node Authorization enabled and using NodeSelectors, you can run your high risk workloads in a separate area. If/when a compromise happens, the blast will hopefully only affect the workloads in that NodePool or Node.

In this case, their solution was to limit the syscalls that the container was allowed to make using a custom seccomp profile applied to the runtime. That of course isn’t the perfect solution but it’s a good step to try and limit the exploitability.

Risk Scenario: A Developer Compromises a Cluster

Evidence

If you don’t know this, I’ll let you in on a secret: organizations that shoot themselves in the foot, don’t need to publicly disclose that this happened most of the time. Shocker!

My point being that if you’ve never worked with Kubernetes yet, you might not appreciate the risk caused by its complexity. I can’t cite any news articles about how a developer took down a public web service because kubectl apply had a typo in its YAML, but I promise you this happens and that you should think it through.

Outcomes

Assuming it’s non-malicious, the developer will have created a denial-of-service condition. The D in STRIDE. Maybe you don’t have data loss, but you’re not going to look good, especially if it’s exposed externally.

Decision

Namespace isolation and defining multi-tenancy expectation helps here. Can each developer or group of developers operate in their own confined namespace? Or does every developer have access to any namespace and therefore any other developer’s workload? Kubernetes support RBAC controls that can define exactly what a developer can access if they need to access the Kubernetes API at all.

To prevent a single user from bricking a cluster, the name of the game is resource isolation. Kubernetes supports defining resource restrictions for workloads that helps minimize the risk of resource-based denial of services like fork bombs. In the same way, you can place restrictions on how a service will autoscale. Sometimes you’d rather gave a DoS condition rather than the bill for how the service scaled up.

Improving secrets management will add another layer of security an attacker will need to bypass. Hashicorp’s Vault is a popular cloud-agnostic tool for managing secrets, and there are a series of guides on how to run Vault with Kubernetes.

Lastly, be aware of single points of failure. Image registries for one are important because if the cluster isn’t able to access an image, the service will simply fail. Similarly if you’re using Operators or Dynamic Admission Controllers, their downtime sends the cluster to a screeching halt.

Tales from the Industry

A mid-size organization hired an intern to help with development. The intern seemed to be more “hip” to modern things like devops and Kubernetes so they unleashed them into the environment. The intern modified a Dockerfile and pushed the changes to the image registry that deployed into the dev Kubernetes cluster. This consequently took down some production services. I’m sure you’ve heard a version of this story with your coworkers at some point.

But let me add some extra plot twists to make it more interesting:

  • The intern wasn’t given prod access so the compromise happened because the image registries were shared between dev and prod. When an image was borked in dev, it also was deployed to prod.
  • The compromise didn’t trigger any alerts and the monitoring couldn’t determine what exactly had happened. CPU and memory spiked but there wasn’t enough information to determine where.
  • The root cause was only identified because the intern said “uh… I think I did something.” There wasn’t a sufficient audit log that showed what was happening in dev and in the image registry to point back to the source easily.

There’s a lot of things learned in this scenario for access controls, auditing, monitoring and resource restrictions but what do you think they fixed first? The image registry. Their policy was that Prod and Dev shall not share resources and shall not communicate with each other so it was important to make sure this was a security boundary.

The best part of this story is there never was an intern. It was me. I took down the cluster during a penetration test when I found image registry credentials in a compromised pod. 😅

Risk Scenario: A Developer is Compromised

Evidence

Developers can get compromised from malware, ransomware, spear phishing, backdoored dependencies, and more. There’s only so much you can do to prevent this from happening. It’s going to happen and if the attacker is able to gain access to their kubectl token or credentials to push code into production, you’re going to need Kubernetes to be configured appropriately.

Outcomes

An attacker steals the SSH keys, credentials, or kubectl token so that they can push code into production and compromise the host. Some questions to explore when thinking through an attacker with dev credentials:

  • Can they pivot to other services or are access controls in place?
  • Is the metadata service exposed or has this been locked down?
  • Can they steal other secrets within the cluster or are they confined just to those the developer had access to?
  • Can they attack the CI/CD system to compromise other workloads?
  • Can they push images into the production image registry?

Decision

The decisions here come down to the overall security beliefs of the organization. Does the organization want to try and restrict every developer from doing anything in the cluster without approval? Probably not in a DevOps world. So you’ll have to balance granting enough access to do what they need to do, while limiting the blast radius of a compromise.

Some organizations have chosen to abstract access to the cluster away so that there are no users that have direct kubectl access. Instead, applications are deployed via the CI/CD process of the org. Tools like Pulumi can handle much of the heavy lifting but there are also more simple solutions like Cluster.Dev’s Github Action.

Tales from the Industry

I can tell you some stories about organizations that have all but solved this problem. My favorite way is by blocking all access to the Kubernetes cluster itself. Done. Security problem solved! Kidding. I mean they blocked console access from all developers and abstracted away the need for the Kubernetes API. This way their developers don’t need to be kubectl apply‘ing their YAML directly into production. There’s been a variety of ways to do this:

Some organizations I’ve worked with will leverage gitops and a CI/CD flow so that you can push code to production, but it has to be reviewed extensively by someone else in the team. Since we don’t want to impact velocity, many of the teams will have credentials to still access the cluster via kubectl but used, it would elicit a break glass scenario within the org. This a pretty low cost solution that is decent at risk mitigations.

Another organization that comes to mind is one where they had clusters of clusters and used clusters to build other clusters. That may sound crazy but the multi-cluster model is going to become much more popular in the future. They leveraged tools like Spinnaker and Terraform to build clusters through a CI/CD process. So when a developer wants a new cluster, they would visit a web application like https://givemeacluster.kubernetes.internal1 so that any developer could build a Kubernetes cluster at any moment and if you wanted access to the cluster via kubectl, you must go through their custom dashboard to request access. I think this truly addressed the risk of developer compromise affecting production because each developer was isolated to their own cluster and no access was granted to the Kubernetes API. This also deferred the risk to this new application that they built so your mileage may vary.

I like this anecodote because it demonstrates the levels to which some organizations go to try and limit the risk of a developer compromising a cluster. It’s a costly solution, but if the scale of your organization has created a legitimate risk to production, then it’s likely worthwhile to invest effort to build the infrastructure.

Risk Scenarios Next Steps

This of course is not comprehensive. I’ve given you some examples of scenarios to plan for but you’ll need to put in the work to build more specific ones that are impactful to your org. Some suggestions to think about:

  • Compromised Build environments (See the recent SolarWinds hack which leveraged this)
  • A public image from Dockerhub that includes Ransomware.
  • Pod running in production run in priviliged mode.
  • Attackers compromise the CI/CD pipeline (See Google’s [Project Zero research])(https://www.theregister.com/2020/11/03/google_project_zero_github_flaw_deadline/) on Github Actions)

You know your organization better than anyone so you’ll be able to come up with scenarios that are relevant. You can take a look at the Magoo Simple Risk Measurement for support.