• tl;dr sec
  • Posts
  • Detecting Credential Compromise in AWS

Detecting Credential Compromise in AWS

Will describes a process he developed at Netflix to detect compromised AWS instance credentials (STS credentials) used outside of the environment in which they were issued. And it doesn’t even use ML!

Will Bengston, Senior Security Engineer, Netflix twitter, linkedin
abstract slides paper video

If Will had said ‘machine learning’ and threw in ‘blockchain’, he’d probably be relaxing on a beach sipping margaritas with those sweet VC dollaz, rather than giving this talk. But fortunately for the security community, he’s continuing to share more awesome security research.

The AWS Security Token Service (STS) is a web service that enables you to request temporary, limited-privilege credentials for AWS Identity and Access Management (IAM) users or for users that you authenticate (federated users).

When you create a credential in AWS, it works anywhere - in your environment or from anywhere on the Internet. Will and his colleagues wanted to lock this down, so that Netflix AWS credentials could only be used from instances owned by Netflix.

This is important because attackers can use vulnerabilities like XXE and SSRF to steal AWS instance credentials and use them to steal sensitive customer data or IP, spin up many servers to do cryptomining, cause a denial of service to Netflix’s customers, and more.

AWS GuardDuty can detect when instance credentials are used outside of AWS, but not from attacker’s operating within AWS.

How do we detect when a credential is used outside of our environment?

This is challenging due to Netflix’s scale (they have 100,000’s of instances at any given point in time) and their environment is very dynamic, IP addresses they control are constantly changing, so it’s not trivial to determine which IP they own at a given point in time.

Another aspect that makes this hard is AWS’s API rate limiting - using the AWS APIs to fully describe their environment across the 3 regions they’re in takes several hours.

The solution Will ended up finding successful leverages CloudTrail.

AWS CloudTrail provides event history of your AWS account activity, including actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services.

These logs are accessible via the console or you can send them via S3 or CloudWatch logs. Note that the delivered logs can be 15 to 20 minutes delayed, so your detection based on these logs will be a bit delayed as well.

Here’s an example CloudTrail log file from the docs:

{"Records": [{
    "eventVersion": "1.0",
    "userIdentity": {
        "type": "IAMUser",
        "principalId": "EX_PRINCIPAL_ID",
        "arn": "arn:aws:iam::123456789012:user/Alice",
        "accessKeyId": "EXAMPLE_KEY_ID",
        "accountId": "123456789012",
        "userName": "Alice"
    },
    "eventTime": "2014-03-06T21:22:54Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "StartInstances",
    "awsRegion": "us-east-2",
    "sourceIPAddress": "205.251.233.176",
    "userAgent": "ec2-api-tools 1.6.12.2",
    "requestParameters": {"instancesSet": {"items": [{"instanceId": "i-ebeaf9e2"}]}},
    "responseElements": {"instancesSet": {"items": [{
        "instanceId": "i-ebeaf9e2",
        "currentState": {
            "code": 0,
            "name": "pending"
        },
        "previousState": {
            "code": 80,
            "name": "stopped"
        }
    }]}}
}]}

The first approach Will tried worked as follows:

  • Learn all of the IPs in your environment (across accounts) for the past hour

  • Compare each IP found in CloudTrail to the list of IPs

    • If we had the IP at the time of log - keep going 👍

    • If we DID NOT have the IP at the time of the log, ALERT

However, this approach did not end up working due to AWS API restrictions (pagination and rate limiting).

The solution that did work leveraged an understanding of how AWS works and making a strong but reasonable assumption. This approach lets you go from 0 to full coverage in about 6 hours (the length of credentials being valid) and can be done historically.

The strong assumption is: we assume that the first use of a given credential is legitimate, and we tie it to the current IP we observe for it.

A session table is maintained that tracks identifier, source_ip, arn, and ttl_value over time for each credential.

The pink path detects a potential compromise, that is, a CloudTrail event with an AssumedRole corresponding a to source IP that is not already present in the session table

At 28:22 Will shows a video of the process working in practice, including a Slack bot that messages a specific channel when good or bad credential usages are observed.

There are a few edge cases you need to account for to prevent FPs:

  • AWS will make calls on your behalf using your creds if certain API calls are made (sourceIPAddress: .amazonaws.com)

  • If you have an AWS VPC endpoint(s) for certain AWS services (sourceIPAddress: 192.168.0.22)

  • If you attach a new ENI or associate a new address to your instance (sourceIPAddress: something new if external subnet)

Preventing Credential CompromisePermalink

While this approach is effective in practice, it’s not perfect.

Using server-side request forgery (SSRF) or XXE, an attacker could steal an instance’s credentials and then use them via the same method.

Specifying a blacklist of all of the ways a URL could represent http://169.254.169.254 in a WAF is prohibitively difficult, so they tried a managed policy that protect resources that are supposed to be internal only. However, this doesn’t protect externally exposed instances, of which they have many.

They looked at what GCP was doing, and observed that their metadata service required an additional header, which is great for protecting against these sorts of attacks, as typically with SSRF you can’t control the request’s HTTP headers.

Will went to the global AWS SDK team and asked if they’d be willing to add an additional header on every request, as it would allow them to build a proxy that protects the metadata service endpoint by blocking every request without that header.

The team said they can’t do that, as they didn’t want to send an additional header the IAM team wasn’t expecting.

So Will reviewed the various open source AWS SDK libraries, and observed that the Ruby one wasn’t sending a user agent, so he submitted a PR that added a user agent to its requests that indicates that it’s coming from the AWS Ruby SDK. When that PR was accepted, he took it to the Java and Python boto SDK teams and got their buy-in as well.

After each SDK team had implemented a user agent clearly demarcating that it was coming from an AWS SDK, Will went to the global AWS SDK and asked them to commit to having these user agents not change, so that AWS users could implement a proxy deploying these protections.

A Masterclass in Organizational Influence
Though this story was just an aside in the overall presentation, it really stuck out to me as an impressive example of getting buy-in across diverse teams with different priorities. Nice work!

And as I called out in tl;dr sec #14, almost a year later, AWS released the v2 of the Instance Metadata Service, which implements several changes to make stealing instance credentials via SSRF or other web attack significantly harder in practice.