# Using a Metadata Proxy to Limit AWS/IAM Access with GitLab CI


The `gitlab-runner` agent is very flexible, with [multiple executors](https://docs.gitlab.com/runner/executors/README.html)
to handle most situations.  Similarly, AWS IAM allows one to use "instance
profiles" with EC2 instances, obviating the need for static, long-lived
credentials.  In the situation where one is running `gitlab-runner` on an EC2
instance, this presents us with a couple interesting challenges -- and
opportunities.

<!--more-->

1. How does one prevent CI jobs from being able to obtain credentials against
   the instance's profile role?
2. How does one allow certain CI jobs to assume credentials through the metadata
   service without allowing _all_ CI jobs to assume those credentials?

## Criteria

{{< admonition info "Relevant Configuration" >}}

This isn't going to be a comprehensive, step-by-step guide that can be
followed without any external knowledge or resources.  Rather, we're going to
focus on what one needs to know in order to implement this solution, however
you're currently provisioning CI agents.

{{< /admonition >}}

For our purposes, we want:

1. The `gitlab-runner` agent to run on an EC2 instance, with one or more
   runners configured.[^runners]
2. All configured runners should be using the [Docker executor](https://docs.gitlab.com/runner/executors/docker.html).
3. Jobs to run, by default, without access to the EC2 instance's profile
   credentials.
4. Certain jobs to assume a specific role transparently through the EC2
   metadata service by virtue of what runner picks them up.
5. Reasonable security:
    * Jobs can't just specify an arbitrary role to assume
    * No hardcoded, static, or long-lived credentials

### Only short-term, transient credentials

It's worth emphasizing this:  no hardcoded, static, or long-lived credentials.
Sure, it's easy to generate an IAM user and plunk its keys in (hopefully)
[protected environment variables](https://docs.gitlab.com/ee/ci/variables/README.html#protect-a-custom-variable), but then you have to worry about key
rotation, audits, etc, in the way one doesn't with transient credentials.

## Executor implies methodology

For our purposes, we're going to solution this using the agent's [docker executor](https://docs.gitlab.com/runner/executors/docker.html).
Other executors will have different solutions (e.g. kubernetes
has tools like [kiam](https://github.com/uswitch/kiam)).

However, for fun let's cheat a bit and do a quick-and-fuzzy run-through of a
couple of the other executors.

{{< admonition note "docker+machine executor" false >}}

This is largely like the plain `docker` executor, except that as EC2 instances
will be spun up to handle jobs you can take a detour around anything complex
by simply telling the agent to associate specific instance profiles with those
new instances, e.g.:

```toml
[[runners]]
  [runners.machine]
    MachineOptions = [
        "amazonEC2-iam-instance-profile=everything-except-the-thing",
        ...,
      ]
```

The instance running the `gitlab-runner` agent does not need to be associated
with the same profile -- but the agent does need to be able to
`EC2:AssociateIamInstanceProfile` and `iam:PassRole` the relevant resources.

The downside is that you'll have to have multiple runners configured
if you want to be able to allow different jobs to assume different roles.

{{< /admonition >}}

{{< admonition note "kubernetes executor" false >}}

The `kubernetes` executor is going to be a bit trickier, and, as ever,
TMTOWTDI[^tmtowtdi].  Depending on what you're doing, any of the following
might work for you:

1. Launch nodes with the different profiles and use constraints to pick and
   choose which job pods end up running on them.
2. Use a solution like [kiam](https://github.com/uswitch/kiam).
3. ...

{{< /admonition >}}

## Brute force

Ever a popular option, you can just brute-force block container (job) access
to the EC2 metadata service by firewalling it off, e.g.:

```sh
iptables -t nat -I PREROUTING \
    --destination 169.254.169.254 --protocol tcp --dport 80 \
    -i docker+ -j REJECT
```

If you just want to block all access from jobs, this is a good way to do it.

This approach is contraindicated if you want to be able to allow _some_
containers to access the metadata service, or to allow them to retrieve
credentials of some (semi) arbitrary role.

## EC2 metadata proxy

A more flexible solution can be found by using a metadata proxy.  This sort of
service should be a benevolent man-in-the-middle: able to access the actual
EC2 metadata service for its own credentials, able to inspect containers
making requests to determine what role (if any) they should be assuming, and
able to assume those roles and pass tokens back to jobs without those jobs
being any the wiser about it.

For our purposes, we will use
[go-metadataproxy](https://github.com/jippi/go-metadataproxy)[^lyftmdp], which
will handle:

1. EC2 metadata requests made by processes in containers (e.g. CI
   jobs);
2. Sourcing its own credentials from the actual EC2 metadata service;
3. Inspecting containers for the IAM role that should be assumed (via the
   `IAM_ROLE` environment variable);
4. Blocking direct access to the EC2 metadata service; and
5. Assuming the correct role and providing STS tokens transparently to the
   contained process.

The authentication flow will look something like this:

{{< mermaid >}}
sequenceDiagram
    autonumber
    participant mdp as metadataproxy
    participant docker
    participant job as CI job
    job->>mdp: client attempts to request credentials from EC2
    mdp-->>docker: inspect job container
    docker-->>mdp: "IAM_ROLE" is "foobar"
    mdp-->>mdp: STS tokens for role "foobar"
    mdp->>job: STS tokens for assumed role "foobar" returned
{{< /mermaid >}}

This also means that the instance profile role must be able to assume the
individual roles we want to allow jobs to assume, and the trust policy of the
individual roles must allow the instance profile role to assume them.

In short:

* The instance profile's IAM role policy should only permit certain roles to
    be assumed, either by ARN or some sensible condition (tagged in a certain
    way, etc).
* Roles in the account, in general, should not blindly trust any principal in
    the account to assume them.[^1]

## Configuring the CI agent correctly

{{< admonition tip "Take care when registering the runner" >}}

We're not going to cover it here, but take care when [registering the
runner](https://docs.gitlab.com/runner/register/index.html).  Under this
approach, **judiciously restricting access to the runner is a critical part of
controlling what jobs may run with elevated IAM authority**.

Keep a couple things in mind:

* Registering runners is cheap; better to have more runners for more granular
    security than allow projects / pipelines with no need for access to use
    them.
* Runners can be registered at the project, group, or (unless you're on
    gitlab.com) the instance level; register them as precisely as your
    requirements allow.
* Runner access can be further restricted and combined with project/group
    access by allowing them to [run against protected refs only](https://docs.gitlab.com/ee/ci/runners/#prevent-runners-from-revealing-sensitive-information),
    and then [restricting who can push/merge to protected branches](https://docs.gitlab.com/ee/user/project/protected_branches.html) (including
    [protected tags](https://docs.gitlab.com/ee/user/project/protected_tags.html)) to trusted individuals.

{{< /admonition >}}

{{< admonition warning "Always set IAM_ROLE in the runner configuration" >}}

Anything that allows a pipeline author to control what role the proxy assumes
is a security... concern.  In this context, `IAM_ROLE` can be set on the
container in one of several ways (in order of precedence):

1. Through the runner configuration;
2. By the pipeline author; or
3. By the creator of the image.

**Unless you intend to allow the pipeline author to specify the role to
assume, it is recommended that `IAM_ROLE` always be set in the runner
configuration file, `config.toml`.**  If you don't want any role to be
assumed, great, set the variable to a blank value.

{{< /admonition >}}

`go-metadataproxy` discovers the role to assume by interrogating the docker
daemon, inspecting the container of the process seeking credentials from the
EC2 metadata service.  It does this by looking for the value of the `IAM_ROLE`
environment set on the container.

`IAM_ROLE` must be set on the container itself.  While
[whitelisting](https://docs.gitlab.com/runner/configuration/advanced-configuration.html#restrict-allowed_images-to-private-registry)
the list of allowed images isn't a terrible idea, the safest and most reliable
way of controlling this as the administrator of the runner is to simply set
the environment variable as part of the runner configuration.

```toml
[[runners]]
  environment = [
    "IAM_ROLE=some-role-name-or-arn",
    ...,
  ]
```

This also means that we're going to want a _runner configuration per IAM
role_.  (Not terribly surprising, I would hope.)

## Running the metadata proxy

This is reasonably straight-forward, in two parts.  There are a number of ways
to run it, but as we're doing this in a docker environment anyways, why not
let it handle all the messy bits for us?

```sh
$ git clone https://github.com/jippi/go-metadataproxy.git
$ cd go-metadataproxy
$ docker build -t local/go-metadataproxy:latest .
$ docker run \
    --detach \
    --restart=always \
    --net=host \
    --name=metadataproxy \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -e AWS_REGION=us-west-2 \
    -e ENABLE_PROMETHEUS=1 \
    local/go-metadataproxy:latest
```

# Using the metadata proxy

To use the proxy, the containers must be able to reach it in the same way they
would reach the actual EC2 metadata endpoint.  We need to prevent requests to
the metadata endpoint from reaching the actual endpoint, and instead be
transparently redirected to the proxy.  (That is, we're going to play
Faythe[^alicebob] here)

To "hijack" container requests to the EC2 metadata service, a little iptables
magic is in order.  This is well described in [the project's
README](https://github.com/jippi/go-metadataproxy#routing-container-traffic-to-go-metadataproxy).
I'm including it here as well for completeness' sake, and with one small
change: instead of redirecting connections off of `docker0`, we reconnect any
off of `docker+`.  (If you're using the runner's [network per
build](https://gitlab.com/gitlab-org/gitlab-runner/-/issues/1042)
functionality, you may need to tweak this.)

As we're exposing the metadataproxy on port 8000, you'll want to make sure
that port is firewalled off from the outside; either via `iptables` or a
security group.

```sh
# this makes an excellent addition to /etc/rc.local
LOCAL_IPV4=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)

/sbin/iptables \
  --append PREROUTING \
  --destination 169.254.169.254 \
  --protocol tcp \
  --dport 80 \
  --in-interface docker+ \
  --jump DNAT \
  --table nat \
  --to-destination $LOCAL_IPV4:8000 \
  --wait

/sbin/iptables \
  --wait \
  --insert INPUT 1 \
  --protocol tcp \
  --dport 80 \
  \! \
  --in-interface docker0 \
  --jump DROP
```

## IAM role requirements

### EC2 Instance Profile

The role belonging to the instance profile associated with the instance our
agent lives on should be able to assume the roles we want to allow CI jobs to
assume.  Specifically, the trust policy must permit `iam:GetRole` and
`sts:AssumeRole` on these roles.

If you're using S3 for [shared runner caches](https://docs.gitlab.com/runner/configuration/autoscale.html#distributed-runners-caching), you may wish to
permit this access through the instance profile role as well.  (Implemented
properly, the proxy will not permit direct CI jobs to use this role.)

### Container / Job IAM roles for assumption

As before, only containers with `IAM_ROLE` set at the container level will
have tokens returned to them by the metadata proxy[^2], and then only if the
proxy can successfully assume and convince STS to issue tokens for them.  For
this to happen, the container/job role's trust policy must alllows the role of
the instance profile associated with the EC2 instance to assume them.
Specifically, the trust policy must permit `iam:GetRole` and `sts:AssumeRole`.

# Profit!

Alright!  You should now have a good idea as to how create and run CI jobs
that:

1. CANNOT request tokens directly from the EC2 metadata service
2. CANNOT implicitly assume the EC2 instance profile's role
3. CANNOT leak static or long-lived credentials
4. CAN transparently assume certain specific roles

Enjoy :)

[^alicebob]: https://en.wikipedia.org/wiki/Alice_and_Bob

[^tmtowtdi]: As every JAPH knows: There's More Than One Way To Do It.

[^lyftmdp]: Lyft also has an excellent tool at
  https://github.com/lyft/metadataproxy.  I've used it with success, but
  `go-metadataproxy` provides at least rudimentary metrics for scraping.

[^1]: Not that anyone would ever create a trust policy like that, or that it
  would be one of the defaults offered by the AWS web console.  Nope.
  That would never happen.

[^2]: Unless, of course, the metadata proxy is configured with a default role
  -- but we're not going to do that here.

[^runners]: The nomenclature gets a bit tricky here.

    `gitlab-runner`
    : The agent responsible for running one or more runner configurations.

    A "runner"
    : A single runner configuration being handled by the `gitlab-runner` agent.
    : An entity that can run CI jobs, from the perspective of the CI server (e.g.
    gitlab.com proper).