Rich Mogull – FireMon.com

Improving the Grand Unified Theory of Cloud Governance

Rich Mogull — Tue, 24 Oct 2023 16:31:56 +0000

A smidge over a year ago I wrote the Grand Unified Theory of Cloud Governance. It’s a concept I’ve been playing with for about 5 or 6 years to try and encapsulate the root cause of the difficulties companies have adapting to cloud. Sure, the title is a bit egotistical, but I used to be a Gartner analyst so, shrug?

Like any (hopefully) good theory I keep evolving it over time as I work with more companies and talk to more people. I’ve been using it a ton over the past couple years in my speaking and training, especially as I’ve been pulled into more governance scenarios. Time and time again the major problems I run into aren’t so much technical, but organizational. Yes, there are many MANY technical complexities to cloud security, and they can and do result in breaches, but in my experience the governance issues far outweigh the technical ones.

Good governance can’t patch a zero day, but bad governance means the attacker never needs one.

The core of the theory hasn’t really changed, I just keep working on better ways to explain it. I also decided to trim it down slightly. Here’s how I currently plop it onto my slides:

Cloud decentralizes operations and infrastructure
But cloud unifies all administrative interfaces
And puts all admin portals and resources on the Internet, protected with a username and password

Going back to the previous version the changes are small but also big:

Cloud has no chokepoints, and thus no gatekeepers.
All administrative and management functions are unified into a single user interface that is on the Internet.
- Protected with a username, password, and, maybe, MFA.
Technology evolves faster than governance.

I still use the “no chokepoints and gatekeepers” in my talk track but I found that it’s a longer way of saying “decentralized”. The fundamental issue is the independent control of the full stack outside of centralized infrastructure. That a dev or app team can build and manage all of their own infrastructure in their own environment with just a credit card. Now there are still some dependencies and controls, especially in the data plane or when you need to tie back into networks, but that doesn’t change the primary point.

Trying to completely recentralize is rarely going to work.

Next, I didn’t really change the unification of administrative interfaces. To expand, we decentralized all the infrastructure and control at the deployment level, but everyone, in the world, uses the same web console and API endpoints.

Attackers have one gateway to infinite targets.

Then I took the sub-bullet from version one and made it bullet 3. These admin portals are all on the Internet, and by default use little more than a username and a password. All resources are also one setting away from being on the Internet, just ask all those S3 buckets and ElasticSearch clusters.

It really is that simple. Teams manage their own stuff independently. They all, around the world, use the same web portals and API endpoints. And any idiot with the right credentials can poke and prod at the back end of your “datacenter”.

Now the crazy part; this was all laid out in 2011 in NIST 800-145, the 2 page NIST Definition of Cloud Computing. That defined the five essential characteristics of cloud computing as:

On-demand Self Service
Broad Network Access
Resource Pooling
Rapid Elasticity
Measured Service

Take the first three points and we have:

Teams manage their own stuff
It’s all on the Internet
And all based on collective resource pools

Alright, so what does this all mean what do we do?

Accept it.

That’s the first step. Understand the problem and use it as a lens to devise our solutions. As I wrote recently in my Strong Authorization post:

Start there. Accept the base reality. What can we do to reduce that risk? To reduce those attacks? I think the most impactful choice is to focus on IAM, and the intersection of governance and IAM. Who do you have manage permissions? Access? What security controls can prevent, detect, and correct IAM related attacks? What are your processes around IAM? Do your incident responders know the deep ins and outs of the IAM for your given cloud provider(s)? Do you use JIT/Strong Authorization? How do you manage IAM for contractors and external services?

Start with your IAM governance and processes. Then choose and use technologies to support them. This is the single most impactful way to improve your cloud security. I really hope I’m not the first person to tell you that.

On Least Privilege, JIT, and Strong Authorization

Rich Mogull — Wed, 18 Oct 2023 18:50:50 +0000

I’ve been employed as a security professional for over 20 years. I cannot possibly count the number of times I have uttered the words “least privilege”. It’s like a little mantra, sitting on the same bench as “defense in depth” and “insider threat”.

But telling someone to enforce least privilege and walking out of the room is the equivalent to the doctor telling you to “eat healthier” while failing you on your insurance physical and walking out of the room before over-charging you.

Least privilege is real. It matters. Unlike changing passwords every 90 days, it can have a material impact on improving your security.

Least privilege is also really hard. Especially at scale. And it doesn’t work for your most important users.

Why? Because least privilege isn’t the least privileges you need at that moment, they are the least privileges you might ever need to do your job… ever. And when someone needs to do something out of scope from when those privileges were first mapped it kicks off a slow change process that has to cross different teams and managers.

Or sometimes you just have to talk Bob into giving you access. And Bob is kind of a defensive jerk since he doesn’t trust anyone and doesn’t want to be blamed when you screw up.

Even with least privilege, if an attacker gets those credentials (the primary source of cloud native breaches) they can still likely do mean things. Because although least privilege isn’t always too horrible to implement for the average user or employee, it’s really hard to enforce on developers and administrators who, by design, need more privileges.

Just as we have MFA for strong authentication, we need something for strong authorization.

This is where Just in Time (JIT) comes into play. Instead of trying to figure out all the privileges someone needs ahead of time, they can request time-limited permissions at any point in time. I now believe that JIT should be the standard for administrative and sensitive access.

I recommend that least privilege is a great concept for general user access, but JIT is better for any level of admin/dev/sensitive access in cloud.

Just in Time

JIT is a flavor of PIM/PAM. Privileged Access Management and Privileged Identity Management are systems designed to escalate a user’s privileges. They operate with a lower level until they need to escalate and these systems use multiple techniques to provide expanded access, usually for a time-limited session. Today isn’t the day to get into the nuance, but the advantage is they allow for flexibility while still maintaining security. Someone must request additional privileges when they need them, so even if their credentials are compromised the attacker is still limited.

“JIT” (Just in Time) is one technique for PAM/PIM (or, really, any access). A user has base credentials that might not have access to anything at all, and then their privileges are escalated on request. We use JIT ourselves (and it’s available in Cloud Defense), and Netflix released an open source tool called ConsoleMe based on their internal tool. Azure has a built in (but for an additional fee) service called Entra ID Privileged Identity Management. (Entra ID is what we used to call Azure AD before someone decided it was a good idea to confuse millions of customers for branding purposes) There are more options, these are just examples.

To enhance security, JIT needs to use an out-of-band approval flow and provide time-limited access. Those are the basics. The request and approval should flow through a different path than the normal authentication, like a form of MFA. The difference is that MFA is an out of band factor for authentication (proving you are who you say you are) and JIT is a form of authorization (you request and receive permission to do something).

Managing Friction

Both least privilege and JIT introduce friction. I mean, everything we do in security introduces some kind of friction, especially Bob. With least privilege the main friction is the overhead to define and deploy privileges, and what breaks when someone doesn’t have privileges they need. With JIT the friction is the process of submitting and receiving an approval.

Having used and researched both least privilege and JIT for a long time, I’ve learned techniques to reduce the friction. In some cases you end up with faster and better processes than how we’ve historically done things

The request and approval flow needs to be real time. This means approvals via ChatOps, text messages, or the 5G chip implanted with your COVID vaccine.
For lower-privileged access, like read access to some logs, you can and should support a self approval. How does this help? Because it still uses the out of band process and reduces the ability of an attacker to leverage lost/stolen/exposed credentials.
You can also support auto-approvals where you don’t even need to click over to self approve. How does this help? You can auto-approve but also use your out of band channel to notify that privileges were escalated. You’ve probably seen this if you ever add a Netflix or Hulu device to your account. Awareness alone can be incredibly effective.
If this is for developers, you need to support the command line and other tools they use. Go to them. Make it super-easy to use. If you force them to log into a security tool the project will fail.
If approvers aren’t responsive, like instantly, you will fail. Don’t make Bob the only approver.

Bring the capability to your devs/admins in the tools they already use. Make it fast and frictionless. Ideally, make it easier and faster than opening up a password manager or clicking around an SSO portal stuffed with 374 cloud accounts to pick from. Buy Bob some cookies. Chocolate chip. (Oh wait, that’s me).

You can also use automation to reduce friction for least privilege access. The Duckbill Group implemented their own version of automated least privilege using different tech with the help of Chris Farris. Tools like AWS Access Advisor are there to help you monitor used permissions and scope them down. Automation is there to help you implement least privilege at scale, and can also be an adjunct to JIT.

When to use which

Least privilege isn’t a dead concept by any means. It’s still the gold standard for everyday users/employees that need a pretty consistent level of access. JIT is best for more-privileged access, especially to production environments, and especially in cloud where credential exposures are THE biggest source of breaches. Here’s where we use it ourselves:

Developer read access to production.
Developer change access to production (outside CI/CD). Far more restricted with more approvers required.
Admin access to prod accounts.
Incident response access.
Some dev account access, since it can be faster than going back to the SSO portal, especially when working on the command line.

I no longer think least privilege alone is a valid concept for any significant level of privileged access in cloud (IaaS/PaaS), even when we use strong MFA. It’s too hard to properly scope permissions at scale, over time. JIT is a far better option in these use cases. Least privilege is still very viable when consistent permissions over time are needed, especially combined with good access logging and MFA. JIT is the companion to MFA. It’s the strong authorization to pair with your strong authentication. As we continue to move more critical operations into management planes that are exposed to the Internet, JIT is the way.

A Paramedic’s Top 2 Tips for Cloud Incident Response

Rich Mogull — Wed, 11 Oct 2023 19:28:10 +0000

One of the advantages of having a lot of unique hobbies is that they wire your brain a little differently. You will find yourself approaching problems from a different angle as you mentally cross-contaminate different domains. As a semi-active Paramedic, I find tons of parallels between responding to meat-bag emergencies and managing bits-and-bytes emergencies.

I’ve been teaching a lot of cloud incident response over the past few years and started using two phrases from Paramedicland that seem to resonate well with budding incident responders. These memory aids do a good job of helping refine focus and optimizing the process. While they apply to any incident response, I find they play a larger role on the cloud side due to the inherent differences caused predominantly by the existence of the management plane.

Sick or Not Sick

Paramedics can do a lot compared to someone off the street, but we are pretty limited in the realm of medicine. We are exquisitely trained to rapidly recognize threats to life and limb, be they medical or trauma, and to stabilize and transport patients to definitive care. One key phrase that gets hammered into us is “sick or not sick.” It’s a memory aid to help us remember to focus on the big picture and figure out if the patient is in deep trouble.

I love using this one to help infosec professionals gauge how bad an incident is. For cloud, we teach them to identify significant findings that require them to hone in on a problem right then and there before moving on. In EMS, it’s called a “life threat.” Since cloud incident response leverages existing IR skills with a new underlying technology, that phrase is just a reminder to consider the consequences of a finding that may not normally trigger a responder’s instincts. Here are some simple examples:

Data made public in object storage (S3) that shouldn’t be.
A potentially compromised IAM entity with admin or other high privileges.
Multiple successful API calls using different IAM users from the same unknown IP address.
Cross-account sharing of an image or snapshot with an unknown account.
A potentially compromised instance/VM that has IAM privileges.

When I write them out, most responders go, “duh, that’s obvious,” but in my experience, traditional responders need a little time to recognize these issues and realize they are far more critical than the average compromised virtual machine.

“Sick or not sick” in the cloud almost always translates to “is it public or did they move into the management plane (IAM).”

Sick or not sick. Every time you find a new piece of evidence, a new piece of the puzzle, run this through your head to figure out if your patient is about to crash, or if they just have the sniffles.

Stop the Bleed

Many of you have probably taken a CPR and First Aid class. You likely learned the “ABCs”: Airway, Breathing, and Circulation.

Yeah, it turns out we really screwed that one up.

Research started to show that, in an emergency, people would focus on the ABCs to the exclusion of the bigger picture. Even paramedics would get caught performing CPR on someone who was bleeding out from a wound to their leg. Sometimes it was perfect CPR. You could tell by how quickly the patient ran out of blood. These days we add “treat life threat” to the beginning, and “stop the bleed” is the top priority.

See where I’m headed?

In every class I’ve taught, I find highly experienced responders focusing on their analysis and investigation while the cloud is bleeding out in front of them. Why?

Because they aren’t used to everything (potentially) being on the Internet. The entire management plane is on the Internet, so if an attacker gets credentials, you can’t stop them with a firewall or by shutting down access to a server. If something is compromised and exposed, it’s compromised and exposed to… well, potentially everyone, everywhere, all at once.

Stop the bleed goes hand in hand with sick or not sick. If you find something sick, do you need to contain it right then and there before you move on? It’s a delicate balance because if you make the wrong call, you might be wasting precious time as the attacker continues to progress. Stop the bleed equals “this is so bad I need to fix it now.” But once you do stop the bleed, you need to jump right back in where you were and continue with your analysis and response process since there may still be a lot of badness going on.

My shortlist?

Any IAM entity with high privileges that appears compromised.
Sensitive data that is somehow public.
Cross-account/subscription/project sharing or access to an unknown destination.

There’s more, but that’s the shortlist. Every one of these indicates active data loss or compromise and you need to contain them right away.

Seeing it in Action

Here’s an example. The screenshots are a mix of Slack, the AWS Console, and FireMon Cloud Defense. That’s my toolchain, and this will work with whatever you have. In the training classes, we also use Athena queries to simulate a SIEM, but I want to keep this post short(ish).

Let’s start with a medium-severity alert in Slack from our combined CSPM/CDR platform:

Sick or Not Sick? We don’t know yet. This could be totally legitimate. Okay, time to investigate. I’ll show this both in the platform and in the AWS console. My first step is to see what is shared where. Since the alert has the AMI ID, we can jump right to it:

Okay- I can see this is shared with another account. Is that an account I own? That I know? My tool flags it as untrusted since it isn’t an account registered with the system, but in real life, I would want to check my organization’s master account list just to double-check.

Okay, sick or not sick? In my head, it’s still a maybe. I have a shared image to a potentially untrusted account. But I don’t know what is shared yet. I need to trace that back to the source instance. I’m not bothering with full forensics; I’m going to rely on contextual information since I need to figure this out pretty quickly. In this case, we lucked out:

It has “Prod” in the name, so… I’m calling this “probably sick.” Stop the bleed? In real life, I’d try to contact whoever owned that AWS account first, but for today, I do think I have enough information to quarantine the AMI. Here’s how in the console and Cloud Defense:

Okay, did we Stop the Bleed? We stopped… part of the bleed. We locked down the AMI, but we still don’t know how it ended up there. We also don’t know who owns that AWS account. Can we find out? Nope. If it isn’t ours, all we can do is report it to AWS and let them handle the rest.

Let’s hunt for the API calls to find out who shared it and what else they did. I’m going to do these next bits in the platform, but you would run queries in your SIEM or Athena to find the same information. I’ll do future posts on all the queries, but this post is focused on the sick/bleed concepts.

Okay- I see an IAM entity named ImageBuilder is responsible. Again, because this post is already running long, I checked a few things and here is what I learned:

ImageBuilder is an IAM user with privileges to create images and modify their attributes, but nothing more. However, the policy has no resource constraints so it can create an image of any instance. And no conditional restraints so it can share with any account. This is a moderate to low blast radius- it’s over-privileged, but not horribly. I call it, sorta-sick.
The API call came from an unknown IP address. This is suspicious, but still only sorta-sick.
It is the first time I see that IP address used by this IAM user, and the user shows previous activity aligned with a batch process. Okay, now I’m leaning towards sick. Usually we don’t see rotating IP addresses for jobs like this, it smells of a lost credential:

That IAM user can continue to perform these actions. Unless someone tells me they meant to do this, I’m calling it Sick and I’m going to Stop the Bleed and put an IAM restriction on that user account (probably a Deny All policy unless this is a critical process, and then I’d use an IP restriction).

In summary:

I found an AMI shared to an unknown account: Sick
That AMI was for a production asset: Sick and Stop the Bleed
The action was from an IAM user with wide privileges to make AMIs and share them, but nothing else: Maybe sick, still investigating.
The IAM user made that AMI from an unknown, new IP address: Sick, Stop the (rest of the) Bleed.
There is no other detected activity from the IP address: Likely contained and no longer Sick
I still don’t know how those credentials were leaked: Sick, and time to call in our traditional IR friends to see if it was a network or host compromise.

I went through this quickly to highlight how I think of these issues. With just a few differences this same finding would have been totally normal. Imagine we realized it was shared with a new account we control but wasn’t registered yet. Or the AMI was for a development instance that doesn’t have anything sensitive in it. Or the API calls came from our network, at the expected time, or from an admins system and they meant to share it. This example is not egregious, but it is a known form of data exfiltration used by active threat actors. As I find each piece of information I evaluate if it’s Sick or Not Sick and if I need to Stop the Bleed.

How is this different in cloud? Because the stakes are higher when everything potentially touches the Internet. We need to think and act faster, and I find this memory aid helpful, to keep us on track.

Try it for Free

See for yourself how Cloud Defense can protect your organization

Unlimited usage at no cost!

Sign Up Now

Thanks for signing up to use the FireMon Cloud Defense Tool. You’ll get an email soon - click the link in that email to create your account and set a password.