Over the past year I’ve seen a huge uptick in interest for concrete advice on handling security incidents inside the cloud, with cloud native techniques. As organizations move their production workloads to the cloud, it doesn’t take long for the security professionals to realize that the fundamentals, while conceptually similar, are quite different in practice. One of those core concepts is that of the kill chain, a term first coined by Lockheed Martin to describe the attacker’s process. Break any link and you break the attack, so this maps well to combining defense in depth with the active components of incident response.
In cloud deployments we have four major categories of attack, each with different kill chains:
- Attacks on the cloud platform itself. Ignoring a fundamental compromise of the cloud provider (outside of the cloud customer’s hands) these attacks typically focus on misconfigurations of cloud services. If you leave an S3 bucket public, fail to put an authorizer on an API Gateway, or expose your credentials for AWS on GitHub, it falls in this category.
- Attacks on customer-deployed resources and applications in the cloud. These traditional attacks are no different than those run against your data center. Common examples include SQL injection in a web application, and vulnerable servers with the wrong ports open to the Internet. These tend to be a bit more constrained than they would be against a data center, assuming you use accounts/subscriptions/project
s and VPCs or virtual networks to limit blast radius. - Attacks against your cloud administrators and developers. Next time you run a penetration test make sure you let the attackers try and phish your developers and admins. This is one of the best ways to pivot into cloud since it’s often a lot easier for an attacker to gain access to a developer system than to break the cloud application itself. We’ll cover this in the future, but let’s just start with saying “MFA is my friend”.
- Blended attacks. This is the category we are going to focus on today. In these attacks the threat actor breaks into something deployed in the cloud and then uses that to pivot into the cloud management plane. (Some would consider attacks against developers to be blended, but I like to break them out separately).
As a rule of thumb I always start by assuming any successful attack at any level can escalate or pivot into a blended attack, at which point your management plane security and incident response become your best defenses.
Today I’m going to focus on one of the most common blended attack processes and outline a mix of detective and preventative controls to help break the kill chain. Before I get into the details, do not take this post of an over-simplification of a complex problem. Managing what I’m about to discuss at scale is incredibly difficult even when you know what you are doing.
Over the next few weeks we will be rolling out our first Ops specifically designed for these issues to our early access customers and we should have them in production relatively soon after that.
The Blended AWS Attack: Extracting IAM Role Credentials
In a blended attack the threat actor breaks into something more traditional and then uses that to pivot into the cloud management plane. There are three primary ways this occurs. In each case the attacker extracts either static stored credentials or ephemeral IAM role credentials, which we will explain in a minute.
- Direct compromise of an instance or a container. For example, if you leave port 22 open and the attacker can hack in or otherwise gain shell access.
- Server Side Request Forgery (SSRF). The attacker takes advantage of (typically) a web server/service vulnerability and can execute commands without gaining shell access.
- Compromise of a Lambda function. Although you can’t gain a shell on a Lambda, they are still subject to code execution vulnerabilities and even arbitrary code execution if they hold an application flaw. The exact implications can resemble SSRF.
In each case the attacker’s goal is to gain credentials to the AWS management plane and then leverage existing privileges or escalate privileges. We will discuss privilege escalation in a future post, and for now will focus on what those credentials are and how you can prevent their abuse.
Most people understand static credentials; which in AWS are an Access Key and a Secret Key. They are like a username and password but are used for AWS API calls. The current version uses a cryptographic process known as Signature 4 for HTTP request signing when you make those API calls. You can and should treat them just like a username and password — and you should never store them within cloud resources such as instances and Lambdas.
IAM roles are tricker when you first get started in AWS, they are both awesome and scary. An IAM Role in AWS is effectively a container for permissions that you use for a session. IAM Roles are great because they aren’t credentials per se… when you assume the role AWS provides a set of credentials for a time limited session. Roles are an “inside AWS only” thing. You can assign them to resources within AWS (like an instance or a Lambda function) and that resource can now make API calls without static, stored credentials! We use roles for federated identity connections, instances, Lambda functions, and every other service within AWS. Access keys are really only when you create a user in an AWS account, we use roles for everything else.
Roles have four permission types associated with them:
- What the role can do within AWS. These are the straight up permission policies you attach to the role.
- Who or what can use the role (the trust policy). Creating a role doesn’t mean anything or anyone can use it, this policy restricts access to, for example, AWS instances or a specific Lambda function.
- A permission boundary to limit the scope of the role. This is a bit more complex and not relevant to our discussion today so we will cover it later.
- When you assume a role for a session, you can also specify a subset of your existing permissions to use for that session. This is a cool feature for least privilege, but also not totally relevant for our discussion today.
It’s probably easier to explain how this works by walking through it. Suppose I have an application that needs to access an S3 bucket or a Dynamo Database. I create an IAM role for the instance and set the trust policy so the EC2 service can use the role. Then I launch an instance and assign the role. AWS runs the instance and has the instance assume the role. Assuming the role opens up a session and assigns an access key, a secret key, and a session token. AWS then rotates those credentials every 1-6 hours and the instance can now make those API calls authorized by the permissions policies.
While the credentials aren’t in the instance, the credentials are still accessible to the instance. Any code running inside needs to know the credentials to make the actual API calls to access S3 and Dynamo so something known as the metadata service provides them on demand The metadata service is a special thing in AWS for instances and containers that holds all the information about how it is configured. It’s pretty darn important for a server to be able to get its IP address, for example.
This is where the attack comes in.
The metadata service is simply a url you can access that returns the requested information. curl 169.254.169.254/latest/meta-data/
will provide all the basic information, and you can use the path curl 169.254.169.254/latest/meta-data/iam-security-credentials/
will provide the access, secret, and token. (In the case of a Lambda-based attack this all looks different and you use SDK code instead of curl, but the same principles apply).
The attacker can then copy those credentials and use them someplace else where they embed them in tools instead of having to load and run code on the compromised server. Also, being URL based, it opens the metadata service up to a wider range of SSRF attacks since you don’t need full arbitrary code execution. The credentials will expire at some point, but depending on the attack they might just come back and get a new set when they see the current ones stop working.
Smart attackers these days will use the credentials in an AWS account they control since Amazon has some tooling to detect credentials extracted and used outside their known address ranges.
Breaking the IAM Role Extraction Kill Chain
Let’s map out the kill chain. The attacker needs to do the following:
- Discover and exploit a vulnerability in an instance, container, or Lambda that allows them to access the role credentials. This is pretty much always a mistake on the customer side… such as failing to patch, opening up the wrong ports, or deploying vulnerable code.
- Extract the current role credentials.
- Successfully run allowed API calls in an environment under their control.
- Do something bad within the allowed IAM role’s permission policy scope. I mean, probably bad, it isn’t like most attackers patch your code for you.
The following techniques can break different links in the chain and include a mix of detective and preventative controls. Don’t feel bad if this looks overwhelming… very, VERY few of the organizations I work with implement these comprehensively, especially at scale.
6 techniques to help break different links in the attack chain
- Vulnerability Management
- Least Privilege IAM Permissions Policies with Resource Restrictions
- Use IP, VPC, or other Request Origin Conditional Restrictions in the Permissions Policies
- Use Service Endpoints with Policies + Resource Policies
- Add Metadata Proxies with HTTP User Agent Filters (Metadata Service Protection)
- Duplicate Role Usage Protection
Vulnerability Management
- Complexity: Moderate
- Effectiveness: Low
- Scalability: Hard
- Type: Detective and Preventative
No surprise, your start should be wiping out all the initial vulnerabilities and misconfigurations the attacker can use to pivot and grab credentials. I only rated the complexity as moderate since there is nothing new or cloud-specific about this. But I also rate the effectiveness as low since it isn’t like comprehensive vulnerability management has prevented the legions of breaches over the past decades. Simple in concept, incredibly complex at scale.
Least Privilege IAM Permissions Policies with Resource Restrictions
- Complexity: Moderate
- Effectiveness: High
- Scalability: Moderate to Hard
- Type: Preventative
IAM policies in AWS are default deny and include explicit allow and deny statements. For example, you can write a policy that only allows access to read an S3 bucket. They also include resource restrictions, so where the allow statement authorizes the role to call the read API, the resource restriction only allows the role to read specific buckets or objects. You should always ALWAYS start your defenses here. When I perform assessments I find, on nearly every single project, IAM policies allowing too many privileges (the API calls) and too few resource restrictions. Yes, that service might need access to a Dynamo database, but does it need access to every table? This particular control is not too bad to implement at small scale but the bigger you are, and the more humans are making these policy decisions, the harder it is to be consistent at scale. It’s also important to add explicit deny statements in case someone adds a new policy with new permissions to the role. Permissions are cumulative, but any deny statements override allow statements.
Use IP, VPC, or other Request Origin Conditional Restrictions in the Permissions Policies
- Complexity: High
- Effectiveness: Moderate to High
- Scalability: Hard
- Type: Preventative
IAM policies support conditional statements that support a range of options including the IP address or source VPC. If you know a particular role should only ever make API calls from a specific resource in your application stack, you can lock the authorization to that exact IP address or subnet. If the attacker steals the credentials and tries to run them someplace else the API calls will fail. This is a precision guided sledgehammer – easy in concept and hard in execution since you might find other complexities interfere with proper implementation. For example, API calls to AWS services either hit the Internet directly, run through an NAT Gateway, or route internally with a Service Endpoint (which we will discuss in a moment). The IP address detected will depend on the route of the API call to the Internet. These are all manageable and detectable (and automatable) but you’ll want to read up first to make sure you understand the permutations.
Check out the first part of this post from Netflix for some examples.
Unless you run your Lambda function on a VPC this won’t be an option to protect a compromised function.
Use Service Endpoints with Policies + Resource Policies
- Complexity: Moderate
- Effectiveness: Moderate to High
- Scalability: Moderate
- Type: Preventative
In AWS a service endpoint is like a tap on the network that takes traffic that would normally go over the Internet to an AWS service and re-routes it internally. They were originally created to allow fully private subnets in AWS, those without any way of reaching the Internet, to still access certain AWS services. Endpoints support policies you can use to restrict access and actions in ways that are very similar to IAM policies. In this case you add restrictions to the endpoint policy to only allow access to specific resources behind that endpoint (S3 is the most common example). You specify which buckets are allowed, and no other resource in that subnet has access to that service. Think of this is a back stop to the IAM policy- with a service endpoint using a restrictive policy even if someone accidentally (or deliberately) allows the role broader access than it should have, it still can’t access anything not allowed in the service endpoint policy. This means we now have three layers of policies, all of which need to allow access to the resource:
- The IAM permission policy that allows the role access to the resource.
- The service endpoint policy that allows access to the resources when requests come through the endpoint, regardless of the permissions of the role used.
- The bucket or resource policy (this depends on the kind of resource) which can restrict access from only the approved IP addresses.
Unless you run your Lambda function on a VPC this, again, won’t be an option to protect a compromised function.
Add Metadata Proxies with HTTP User Agent Filters (Metadata Service Protection)
- Complexity: High
- Effectiveness: Moderate
- Scalability: Hard
- Type: Preventative
All of these controls assume an attacker can steal the role credentials, but what if we had a way to reduce their ability to get those credentials even if they compromise the authorized instance or container? (This technique won’t work for Lambda functions). An emerging option is to restrict access to the metadata service in the first place. Although there have been some attempts to do this with IPTables that could also break required functionality for the code you have running on the instance. In November of 2018 AWS and Netflix worked together and started adding user data for API calls made from AWS SDKs to the HTTP headers. This is a defense against SSRF since most SSRF attacks rely on tricking an application to make HTTP requests on behalf of the attacker, but those requests usually come from a command line tool like curl or another process and will lack the user data header that comes from the AWS SDKs. To make this work you need to insert a proxy for those requests. There are some open source options for instances and containers, including proxies that run on the instance instead of requiring you to route traffic to a virtual appliance or squid proxy.
This technique will not work if the attacker compromises the host instance and runs a shell, since they can disable the Roxy or hijack the approved process.
You can read all about it in the details of this Netflix post.
Duplicate Role Usage Protection
- Complexity: High
- Effectiveness: High
- Scalability: High
- Type: Detective
This is another one that comes from the team at Netflix. They released a how-to for an excellent technique to detect when an IAM role is being used in an unauthorized location, even within AWS. I highly recommend you read the linked post, but the short version is they combine CloudTrail logs with some other tooling to keep a table of which instances are using which roles from which IP addresses. They then monitor other API calls to find when a role is being re-used from a new IP address at the same time it is in use from an approved one. Using this method you don’t need to know all your IP addresses in use throughout the organization, you dynamically build a table of what’s in use and detect when that role is being used someplace else at the same time. This is extremely scalable since you can run the logic centrally if you centralize CloudTrail, which is a common best practice anyway.
Summary
This is yet another beast of a post, and we don’t expect everyone can implement every one of these options in all deployments. To simplify, let’s walk through the IAM Role abuse kill chain:
- Discover and exploit a vulnerability in an instance, container, or Lambda that allows them to access the role credentials. This is pretty much always a mistake on the customer side… such as failing to patch, opening up the wrong ports, or deploying vulnerable code.
- Vulnerability management (including tools like SASST and DAST for applications) and assessing your cloud configuration (using tools like DisruptOps or open source tools like Prowler and CloudMapper) are your fist defense.
- Extract the current role credentials.
- Metadata Service Protection and vulnerability management
- Successfully run allowed API calls in an environment under their control.
- Duplicate Role Usage Detection, Use IP, VPC, or other Request Origin Conditional Restrictions in the Permissions Policies, Use Service Endpoints with Policies + Resource Policies
- Do something bad within the allowed IAM role’s permission policy scope. I mean, probably bad, it isn’t like most attackers patch your code for you.
- Least Privilege IAM Permissions Policies with Resource Restrictions
Hopefully this gives you a better picture of how to reduce the success of these kinds of attacks.