Fail, Learn, Fix

History: Lessons from Electrical Work

Bryan Payne, Director of Engineering, Product & Application Security, Netflix twitter, linkedin
abstract slides video

History: Lessons from Electrical Work

Bryan did some electrical work over Christmas break and he was impressed that you could buy various parts made from different companies, put them together, and it was overall safe and straightforward.

How did they get to that place?

It wasn’t always this way. In the 1880s, it was a Wild West and people were getting electrocuted. But then a standards body was created that wrote a document, “National Electrical Code,” containing thoses standards as well as best practices. This caused deaths from electrocution to trend down over time.

This is a general practice in engineering - examine failures, which may result from technical issues or people, and then spread that knowledge, so that the industry as a whole can learn and get better.

Fail, Learn, Fix in Computing

Bryan gives an example of the Therac-25, a software-controlled radio therapy machine.

The Therac-25 had a software bug that ended up killing several people. Unlike prior machines, it didn’t have hardware protections that would blow a fuse and fail safely if parameters went outside of expected bounds.

The Therac-25 had a number of major flaws:

  • Lack of documentation

  • Insufficient testing

  • Cryptic / frequent error messages

  • Complicated software programmed in assembly

  • Custom real time OS

  • No fault tolerance / redundancy

  • Systemic failures - no hardware safeguards for software faults

Learnings included:

  • Test properly and thoroughly

  • Software quality and assurance must be a design priority from the beginning

  • Safety and quality are system attributes, not code attributes

  • Interface usability is important

  • Safety critical systems should be created by qualified personnel

A few years ago, someone wrote a retrospective on what had happened (The Therac-25: 30 years Later), and covered topics like: how has the industry evolved? Are we doing better?

The long and short of it is - not really.

But because of the Therac-25, a law was passed that allowed the FDA to do a mandatory recall of devices, and stats had to be aggregated for device failures.

Fail, Learn, Fix in Security

Software security started in government with the rainbow books, which were a set of guidelines for how you could evaluate the security of a system.

…security is expensive to set up and a nuisance to run… While we await a catastrophe, simpler setup is the most important step toward better security.

From Butler Lampson’s 2005 USENIX Security Keynote Address, Computer Securityin the Real World.

This is still true today.

When Bryan rates how the security industry is doing currently, he gives it an A+ for Failing, C for Learning, and F for Fixing.

Paths to Success

Companies do retrospectives sometimes, but it’s all internal. We need to have detailed broader retrospectives on security issues with our peers. We’re not learning as much as we could be.

To Bryan, the biggest thing we need to do is not come up with some fancy new technical solution, but rather to talk, digging into the security problems we’re seeing together. What are the common themes? How do we start to move all of this forward together? As we do this, we can start to identify patterns.

Currently security patterns are sort of like the sewing industry - you can buy one from one company and it’s totally different from what you’d get somewhere else.

We need to think about how we can advance these patterns in a way that helps the whole industry. If there are problems in the patterns, then we fix them and the whole industry gets better, not just the one company.

Munawar Hafiz has a Security Pattern Catalog page that lists many patterns, and Microsoft has some nice ones on their Azure site as well.

If we want to make an actual difference, we need to find a way to package these patterns and make them easy for devs to adopt.

At Netflix, they have a config file where devs can simply set the following, which will have the application automatically use a strong, modern TLS configuration when communicating with other services.

ENDPOINT_SECURITY=enabled

Here’s how we can learn better:

  • Align on how we talk about our systems and our failures

  • Share lessons across the industry

  • Identify trends

  • Connect trends to risk/impact

Here’s how we can fix better:

  • Have security experts agree on proper patterns for fixing problems

  • Create real world implementations of the patterns

  • Ensure that it is trivial to use the implementations correctly

  • Integrate security into the computing ecosystem

Security Success Stories

Bryan calls out several security successes that are headed in the right direction.

Service to service auth has historically been different at every company, but SPIFFE offers a common to handle service identity.

Software auto-updating, in browsers for example, has been a huge win. We should do this everywhere.

The percentage of Alexa 1M sites using HTTPS passed 50% in 2018, which is fantastic.

Let’s work on securing the whole system, not just a little slice of it. And let’s secure it in a way that works for everyone.

The way that we get there is by figuring out how we fail together, how we learn together, and ultimately how we fix the problems together.

Questions

Can you give a specific example of a failure Netflix had?

One AppSec engineer spent 6 - 9 months working on operationalizing an automated vulnerability scanner that never ended up finding a vulnerability.

The AppSec team learned that they didn’t have great scoping when they went into the project- their goals weren’t well aligned with reality.

Now the AppSec team focuses their efforts not on vulnerability identification but other areas in AppSec that are easier to solve.

They also learned that they need a set of objectives and success criteria to determine if they’re spending their time well way earlier in the process.

This would have allowed them to pull the plug 3 weeks in when they realized it wouldn’t work, rather than after 6 months.

Should we standardize on certain configurations (e.g. TLS)? This might make it easier to update and deploy those updates everywhere across orgs, bringing everyone to the same security bar.

Bryan avoided saying “standards” in the presentation because people bristle at it. Standards are hard, but probably the direction we need. Standards are why networking works, but we’ve moved away from standards as we’ve gone up the stack, e.g. cloud services.