How Uber Continuously Monitors the Security of its AWS Environment
Uber describes their continuous cloud monitoring service and the workflows and process design that makes it successfully adopted by engineering teams.
This post contains my notes and snippets from an excellent two part blog post by Uber’s security engineering team.
I originally referenced these posts in tl;dr sec #40.
Uber has built a cloud-agnostic, continuous monitoring service called Cloud Monitoring (CMON), which this article describes.
How it Works
CMON uses Hammer, an open source project by Dow Jones composed of a collection of Lambda functions written in Python. These Lambdas act as configuration violation checkers across AWS accounts and aggregate all the findings in an easy to manage set of DynamoDB tables in a centralized account.
Hammer is the single point of integration used to consume all AWS security findings into CMON. This enables Uber’s security team to get a correlatable view of all findings, enabling visibility across the larger organization, member accounts, and resources.
CMON is an on-premise Golang service that acts on Hammer updates, including newly discovered issues as well as existing issues that have been resolved.
CMON calculates vulnerability ratings to help prioritize cloud security findings. CMON also integrates with other Uber security services such as generating customized and actionable Jira tickets for resource owners.
Note how CMON has included in the Jira issue useful context such as risk level, solutions, and commentary. Crucially, what is wrong and how to fix it are detailed.
CMON automatically processes security issue lifecycle states like opening, resolving as well as whitelisting. It leverages the Engineering Security team’s vulnerability framework to calculate risk score and urgency.
Hammer has built-in capabilities to discover and verify cloud security fixes. CMON has several automated features such as periodic follows-up/escalations, risk visibility, and incident management.
This was just a small section in the article, but it seems hugely valuable to me. I bet removing (or at least minimizing) the manual follow-up burden saves hundreds to thousands of person-hours.
Engineers are empowered to triage cloud security findings based on risk, mitigate security issues, and mark findings as false positive.
All findings have a link to an issue-specific runbook that provides multiple remediation options along with practical tips to choose an option suitable for their requirements.
Key takeaway: I see this again and again in successful security programs. Security issues given to developers need actionable, detailed steps on fixing it. Engineers are busy, make it easy for them.
Scalable, cost effective, extensible
You get these mostly for free by using Lambdas as the work unit.
How to Build Security Processes that Work
1. Shut down the Risk Factory
It’s demoralizing to work on a high risk ticket only to see new, identical tickets appearing.
Where possible, implement measures to systemically address recurring security issues so that teams feel like they’re working towards an endgame.
CMON associates risk ratings with all findings, which enables resource owners to go after the most risky findings first.
2. One to Zero
One to zero is a powerful motivation. Focusing on specific issues / vulnerability classes is more effective than a generalized attention on all issues.
Reducing open issues from 728 to 492 may not feel like a win but permanently closing out all tickets for a specific issue class is a significant victory.
3. Organic > Mechanical Outreach
CMON sends errors and alerts directly to customers (engineers). Remember that customers respond much better to organic, personal outreach.