Using a Metadata Proxy to Limit AWS/IAM Access with GitLab CI

2020-10-05 1834 words 9 minutes

/2020/10/using-a-metadata-proxy-for-secure-iam-in-gitlab-ci/phone-operator.jpg

Contents

The gitlab-runner agent is very flexible, with multiple executors to handle most situations. Similarly, AWS IAM allows one to use “instance profiles” with EC2 instances, obviating the need for static, long-lived credentials. In the situation where one is running gitlab-runner on an EC2 instance, this presents us with a couple interesting challenges – and opportunities.

How does one prevent CI jobs from being able to obtain credentials against the instance’s profile role?
How does one allow certain CI jobs to assume credentials through the metadata service without allowing all CI jobs to assume those credentials?

Criteria

Relevant Configuration

This isn’t going to be a comprehensive, step-by-step guide that can be followed without any external knowledge or resources. Rather, we’re going to focus on what one needs to know in order to implement this solution, however you’re currently provisioning CI agents.

For our purposes, we want:

The gitlab-runner agent to run on an EC2 instance, with one or more runners configured.¹
All configured runners should be using the Docker executor.
Jobs to run, by default, without access to the EC2 instance’s profile credentials.
Certain jobs to assume a specific role transparently through the EC2 metadata service by virtue of what runner picks them up.
Reasonable security:
- Jobs can’t just specify an arbitrary role to assume
- No hardcoded, static, or long-lived credentials

Only short-term, transient credentials

It’s worth emphasizing this: no hardcoded, static, or long-lived credentials. Sure, it’s easy to generate an IAM user and plunk its keys in (hopefully) protected environment variables, but then you have to worry about key rotation, audits, etc, in the way one doesn’t with transient credentials.

Executor implies methodology

For our purposes, we’re going to solution this using the agent’s docker executor. Other executors will have different solutions (e.g. kubernetes has tools like kiam).

However, for fun let’s cheat a bit and do a quick-and-fuzzy run-through of a couple of the other executors.

docker+machine executor

This is largely like the plain docker executor, except that as EC2 instances will be spun up to handle jobs you can take a detour around anything complex by simply telling the agent to associate specific instance profiles with those new instances, e.g.:

1
2
3
4
5
6


[[runners]]
  [runners.machine]
    MachineOptions = [
        "amazonEC2-iam-instance-profile=everything-except-the-thing",
        ...,
      ]

The instance running the gitlab-runner agent does not need to be associated with the same profile – but the agent does need to be able to EC2:AssociateIamInstanceProfile and iam:PassRole the relevant resources.

The downside is that you’ll have to have multiple runners configured if you want to be able to allow different jobs to assume different roles.

kubernetes executor

The kubernetes executor is going to be a bit trickier, and, as ever, TMTOWTDI[^tmtowtdi]. Depending on what you’re doing, any of the following might work for you:

Launch nodes with the different profiles and use constraints to pick and choose which job pods end up running on them.
Use a solution like kiam.
…

Brute force

Ever a popular option, you can just brute-force block container (job) access to the EC2 metadata service by firewalling it off, e.g.:

1
2
3


iptables -t nat -I PREROUTING \
    --destination 169.254.169.254 --protocol tcp --dport 80 \
    -i docker+ -j REJECT

If you just want to block all access from jobs, this is a good way to do it.

This approach is contraindicated if you want to be able to allow some containers to access the metadata service, or to allow them to retrieve credentials of some (semi) arbitrary role.

EC2 metadata proxy

A more flexible solution can be found by using a metadata proxy. This sort of service should be a benevolent man-in-the-middle: able to access the actual EC2 metadata service for its own credentials, able to inspect containers making requests to determine what role (if any) they should be assuming, and able to assume those roles and pass tokens back to jobs without those jobs being any the wiser about it.

For our purposes, we will use go-metadataproxy², which will handle:

EC2 metadata requests made by processes in containers (e.g. CI jobs);
Sourcing its own credentials from the actual EC2 metadata service;
Inspecting containers for the IAM role that should be assumed (via the IAM_ROLE environment variable);
Blocking direct access to the EC2 metadata service; and
Assuming the correct role and providing STS tokens transparently to the contained process.

The authentication flow will look something like this:

This also means that the instance profile role must be able to assume the individual roles we want to allow jobs to assume, and the trust policy of the individual roles must allow the instance profile role to assume them.

In short:

The instance profile’s IAM role policy should only permit certain roles to be assumed, either by ARN or some sensible condition (tagged in a certain way, etc).
Roles in the account, in general, should not blindly trust any principal in the account to assume them.³

Configuring the CI agent correctly

Take care when registering the runner

We’re not going to cover it here, but take care when registering the runner. Under this approach, judiciously restricting access to the runner is a critical part of controlling what jobs may run with elevated IAM authority.

Keep a couple things in mind:

Registering runners is cheap; better to have more runners for more granular security than allow projects / pipelines with no need for access to use them.
Runners can be registered at the project, group, or (unless you’re on gitlab.com) the instance level; register them as precisely as your requirements allow.
Runner access can be further restricted and combined with project/group access by allowing them to run against protected refs only, and then restricting who can push/merge to protected branches (including protected tags) to trusted individuals.

Always set IAM_ROLE in the runner configuration

Anything that allows a pipeline author to control what role the proxy assumes is a security… concern. In this context, IAM_ROLE can be set on the container in one of several ways (in order of precedence):

Through the runner configuration;
By the pipeline author; or
By the creator of the image.

Unless you intend to allow the pipeline author to specify the role to assume, it is recommended that IAM_ROLE always be set in the runner configuration file, config.toml. If you don’t want any role to be assumed, great, set the variable to a blank value.

go-metadataproxy discovers the role to assume by interrogating the docker daemon, inspecting the container of the process seeking credentials from the EC2 metadata service. It does this by looking for the value of the IAM_ROLE environment set on the container.

IAM_ROLE must be set on the container itself. While whitelisting the list of allowed images isn’t a terrible idea, the safest and most reliable way of controlling this as the administrator of the runner is to simply set the environment variable as part of the runner configuration.

1
2
3
4
5


[[runners]]
  environment = [
    "IAM_ROLE=some-role-name-or-arn",
    ...,
  ]

This also means that we’re going to want a runner configuration per IAM role. (Not terribly surprising, I would hope.)

Running the metadata proxy

This is reasonably straight-forward, in two parts. There are a number of ways to run it, but as we’re doing this in a docker environment anyways, why not let it handle all the messy bits for us?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


$ git clone https://github.com/jippi/go-metadataproxy.git
$ cd go-metadataproxy
$ docker build -t local/go-metadataproxy:latest .
$ docker run \
    --detach \
    --restart=always \
    --net=host \
    --name=metadataproxy \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -e AWS_REGION=us-west-2 \
    -e ENABLE_PROMETHEUS=1 \
    local/go-metadataproxy:latest

Using the metadata proxy

To use the proxy, the containers must be able to reach it in the same way they would reach the actual EC2 metadata endpoint. We need to prevent requests to the metadata endpoint from reaching the actual endpoint, and instead be transparently redirected to the proxy. (That is, we’re going to play Faythe⁴ here)

To “hijack” container requests to the EC2 metadata service, a little iptables magic is in order. This is well described in the project’s README. I’m including it here as well for completeness’ sake, and with one small change: instead of redirecting connections off of docker0, we reconnect any off of docker+. (If you’re using the runner’s network per build functionality, you may need to tweak this.)

As we’re exposing the metadataproxy on port 8000, you’ll want to make sure that port is firewalled off from the outside; either via iptables or a security group.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


# this makes an excellent addition to /etc/rc.local
LOCAL_IPV4=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)

/sbin/iptables \
  --append PREROUTING \
  --destination 169.254.169.254 \
  --protocol tcp \
  --dport 80 \
  --in-interface docker+ \
  --jump DNAT \
  --table nat \
  --to-destination $LOCAL_IPV4:8000 \
  --wait

/sbin/iptables \
  --wait \
  --insert INPUT 1 \
  --protocol tcp \
  --dport 80 \
  \! \
  --in-interface docker0 \
  --jump DROP

IAM role requirements

EC2 Instance Profile

The role belonging to the instance profile associated with the instance our agent lives on should be able to assume the roles we want to allow CI jobs to assume. Specifically, the trust policy must permit iam:GetRole and sts:AssumeRole on these roles.

If you’re using S3 for shared runner caches, you may wish to permit this access through the instance profile role as well. (Implemented properly, the proxy will not permit direct CI jobs to use this role.)

Container / Job IAM roles for assumption

As before, only containers with IAM_ROLE set at the container level will have tokens returned to them by the metadata proxy⁵, and then only if the proxy can successfully assume and convince STS to issue tokens for them. For this to happen, the container/job role’s trust policy must alllows the role of the instance profile associated with the EC2 instance to assume them. Specifically, the trust policy must permit iam:GetRole and sts:AssumeRole.

Profit!

Alright! You should now have a good idea as to how create and run CI jobs that:

CANNOT request tokens directly from the EC2 metadata service
CANNOT implicitly assume the EC2 instance profile’s role
CANNOT leak static or long-lived credentials
CAN transparently assume certain specific roles

Enjoy :)

The nomenclature gets a bit tricky here.

gitlab-runner

The agent responsible for running one or more runner configurations.

A “runner”

A single runner configuration being handled by the gitlab-runner agent.

An entity that can run CI jobs, from the perspective of the CI server (e.g. gitlab.com proper).

↩︎
Lyft also has an excellent tool at https://github.com/lyft/metadataproxy. I’ve used it with success, but go-metadataproxy provides at least rudimentary metrics for scraping. ↩︎
Not that anyone would ever create a trust policy like that, or that it would be one of the defaults offered by the AWS web console. Nope. That would never happen. ↩︎
https://en.wikipedia.org/wiki/Alice_and_Bob ↩︎
Unless, of course, the metadata proxy is configured with a default role – but we’re not going to do that here. ↩︎