tl;dr sec
Posts
Netflix’s Layered Approach to Reducing Risk of Credential Compromise

Netflix’s Layered Approach to Reducing Risk of Credential Compromise

Clint Gibler
June 13, 2023

Will Bengston, Senior Security Engineer, Netflix twitter, linkedin
Travis McPeak, Senior Security Engineer, Netflix twitter, linkedin
abstract slides video

An overview of efforts Netflix has undertaken to scale their cloud security, including segmenting their environment, removing static keys, auto-least privilege of AWS permissions, extensive tooling for dev UX (e.g. using AWS credentials), anomaly detection, preventing AWS creds from being used off-instance, and some future plans.

Segment Environment Into Accounts

Why? If the account gets compromised, the damage is contained.

The Netflix security teams have built a nice Paved Road for developers, a suite of useful development tools and infrastructure. When you’re using the Paved Road, everything works nicely and you have lots of tools available to make you more efficient.

But there are some power users who need to go outside the Paved Road to accomplish what they need to do.

At Netflix, the security team generally can’t block developers- they need to avoid saying “no” when at all posssible.

Useful for separation of duties So the security team will instead put these power users in their own AWS account so they can’t affect the rest of the ecosystem.

Useful for sensitive applications and data Only a limited set of users can access these apps and data.

Reduce friction by investing in tooling to C.R.U.D. AWS accounts. If you want to do account level segmentation, you need to invest in some, for example, making it easy to spin, delete, and modify meta info for accounts. The Netflix cloud security team has invested heavily in these areas.

Remove Static Keys

Why? Static keys never expire and have led to many compromises, for example, when AWS keys in git repos are leaked to GitHub.

Instead, they want short-lived keys, delivered securely, that are rotated automatically.

Netflix does this by giving every application a role, and then the role is provided with short-lived credentials by the EC2 metadata service.

Permission Right Sizing

For many companies, it can be difficult to keep up with all of the services you’re running, and it’s easy for a service to get spun up that ends up being forgotten, if development leaves the company or gets moved onto a different. This represents recurring risk to your company, as these apps may have been given sensitive AWS permissions.

Netflix reduces this risk via RepoKid (source code, Enigma 2018 talk video). New apps at Netflix are granted a base set of AWS permissions. RepoKid gathers data about app behavior and automatically removes AWS permissions, rolling back if failure is detected.

When you build a cool tool, you gotta get a cool logo

This causes apps converge to least privilege without security team interaction, and unused apps converge to zero permissions! 🎆

RepoKid uses Access Advisor and CloudTrail as data sources. Access Advisor allows it to determine, for a given service, has it been used in a threshold amount of time? CloudTrail provides: what actions have been called, by when, and by whom?

Paved Road for Credentials

They wanted to have a centralized place where they could have full visibility into Netflix’s use of AWS credentials, so they built a suite of tools where they could provision credentials by accounts, roles, and apps as needed. If they could ensure that everyone used these tools, they’d know, for every AWS credential, who requested them and how they’re being used.

Before they built this tooling, developers would SSH onto boxes and access creds there, or curl an endpoint and do a SAML flow, but there wasn’t one solidified process to access creds, which made it difficult to monitor.

So the Netflix cloud security team built a service, ConsoleMe, that can handle creating, modifying, and deleting AWS creds.

Users can request credentials via a web interface using SSO or through a CLI

Another advantage of this approach is that when ConsoleMe is creating creds, it automatically injects a policy that IP restricts the creds to the VPN the requester is connected to, so even if the creds accidentally get leaked, they won’t work.

Because the cloud security team worked hard to make using ConsoleMe seamless for devs, they no longer see any devs SSHing in to an EC2 instance and getting creds that are valid for 6 hours, devs instead use the creds they receive from ConsoleMe that are only valid for 1 hour, reducing potential exposure time.

Benefits:

ConsoleMe provides a central place to audit and log all access to creds.
Anomaly detection If someone is trying to request creds to a service they don’t own, or something is behaving strangely, they can detect those anomalies and investigate.

Their biggest win has been locking credentials down to the Netflix environment, so if the creds get leaked in some way there’s no damage.

Delivery Lockdown

Netflix uses Spinnaker for continuous delivery. Several hardening improvements were made, including restricting users to only being able to deploy a role if you owned the application in question, as you might be able to escalate your privileges if you chose a role with more than your current set of permissions, as well as tagging application roles to specific owners.

Prevent Instance Credentials from Being Used Off-instance

Goal: If attacker tries to steal creds (e.g. through SSRF or XXE), the creds won’t work.

See Will’s other talk, Detecting Credential Compromise in AWS for details.

They block AWS creds from being used outside of Netflix’s environment, and attempts to do so are used as a valuable signal of a potential ongoing attack or a developer having trouble, who they can proactively reach out to and help.

The more signals we can get about things going wrong in our environment, the better we can react.

Improving Security and Developer UX
One thing Travis and Will mentioned a few times, which I think is really insightful, is that the logging and monitoring they've set up can both detect potential attacks as well as let them know when a developer may be struggling, either because they don't know how systems work or if they need permissions or access they don't currently have.

Oftentimes the security team plays the role of locking things down. Things become more secure, but also harder to use. This friction either slows down development or causes people to go around your barriers to get their jobs done.

What's so powerful about this idea is the point that the systems you build to secure your environment can also be used to detect when these systems are giving people trouble, so the security team can proactively reach out and help.

Imagine you were starting to use a new open source tool. You're having trouble getting it to work, and then the creator send you a DM, "Hey, I see you're trying to do X. That won't work because of Y, but if you do Z you'll be able to accomplish what you're trying to do. Is that right, or is there something else I can help you with?" Holy cow, that would be awesome 😍

One thing I've heard again and again from security teams at a number of companies, for example, in our panel Lessons Learned from the DevSecOps Trenches, is that to really get widespread adoption of security initiatives more broadly in your org, the tooling and workflow needs to not just be easy and frictionless, it ideally also needs to provide additional value / make people's lives better than what they were previously doing.

Keep this in mind next time your security team is embarking on a new initiative. After all, a technically brilliant tool or process isn't that useful if no one uses it.

Detect Anomalous Behavior in Your Environment

Netflix tracks baseline behavior for accounts: they know what apps and users are doing, and they know what’s normal. This let’s you do neat things once you realize:

Some regions, resources, & services shouldn’t be used 🛑

Netflix only uses certain AWS regions, resources and services - some they don’t use at all. Thus when activity occurs in an unused region, or an AWS service that is not used elsewhere generates some activity, it’s an immediate red flag that should be investigated.

Unused Services

A common attack pattern is when one gets a hold on some AWS credentials or has shell access to an instance, you run an AWS enumeration script that determines the permissions you have by iteratively calling a number of API calls. When unused services are called, the Netflix cloud security team is automatically alerted so they can investigate.

This approach has been used to stop bug bounty researchers quickly and effectively.

Anomalous Role Behavior

This is the same idea as for services, but at the application / role level. Applications tend to have relatively consistent behavior, which can be determined by watching CloudTrail.

The cloud security team watches for applications that start behaving very differently as well as common attacker first steps once they gain access (e.g. s3:ListBuckets, iam:ListAccessKeys, sts:GetCallerIdentity, which is basically whoami on Linux). These API calls are useful for attackers, but not something an application would ever need to do.

Future

Travis and Will shares a few items on the Netflix cloud security team’s future road map.

One Role Per User

Traditionally Netflix has had one role that’s common to a class of users; that is, many apps that need roughly the same set of permissions are assigned the same AWS role.

However, if there are likely at least slight differences between the permissions these apps need, which means some apps are over provisioned. Further, grouping many apps under the same role makes it harder to investigate potential issues and do anomaly detection.

In the future, when every user/app has their own AWS role, they can guarantee least privilege as well as do fine-grained anomaly detection.

Remove Users from Accounts They Don’t Use

Will and Travis would like to automatically remove users from AWS accounts they don’t use. This reduces the risk of user workstation compromise by limiting the attacker’s ability to pivot to other, more interesting resources- an attacker who compromises a dev laptop only gains access to the services they actively use.

Offboarding is hard. Devs may stop working on a project, move between teams, or leave the company. Having an automated process that detects when someone hasn’t used a given account within a threshold amount of time and removes the access would significantly help keeping things locked down over time.

Whole > Sum of the Parts

All of these components are useful in isolation, but when you layer them together, you get something quite hard to overcome as an attacker, as there are many missteps that can get them detected: they need to know about the various signals Netflix is collecting, which services are locked down, etc. The goal is to frustrate attackers and cause them to go for easier targets.