Making and deploying AMIs with Gitlab CI

Filed under aws on March 04, 2021

I’ve had three weeks off from work for a little break after a particularly tough job, so I decided to start sharpening some of my AWS knowledge and actually get that DevOps Pro certification sometime soon.

One of the practice questions I was looking at dealt with quickly spinning up EC2 instances in an autoscaling group, the answer to which was to bake an AMI using a script. I realised I’d never actually done that before, so thought I’d give it a go.

The Repo and Things I can probably improve on

Here’s the gitlab link. This is just a barebones thing to prove to myself I can do it and to have a reference I can pilfer code from later on. There’s a couple of things I think I can improve on here:

Better deployment of the code to the AMI build instance. Using scp and SSH seems a little clunky, I reckon there’s probably a better way using Ansible or the like.
Cleanup. I think the way I make sure the instance is shutdown can be improved, probably using tags to find the instance rather than keeping the ID in a file
Python scripts would probably be a better way to do this. Bash works, but like anything it can be a bit difficult to debug
I could probably be a little smarter about bidding on spot instances and get a better rate.

What I’ve Built

Let’s have a look at the pipeline

Very simple, nothing special in these steps.

Build the Binaries

The first gitlab job is specified as such

Build Binary:
  stage: build-binary
  image: golang:alpine3.13
  variables:
    GOOS: linux
    GOARCH: amd64
    CGO_ENABLED: 0
  artifacts:
    paths:
      - build/
  script:
    - go build -o build/hello-world cmd/main.go

We just put the compiled binary into the build folder and export it as an artifact

One thing I noticed, in order to do it this way we need to turn CGO off to make it run, otherwise we get linker issues at the end of the road.

Build the AMI

Here is the meat of the operation. The ami-baker.sh bash script here does a few different things:

Generates a temporary RSA key
Orders a spot instance
Uses scp to copy over the binary we built in the last step
Uses ssh to put the binary in /opt and configure systemd
Calls the create image API to generate the AMI
Shuts down the build instance

The gitlab job:

Build AMI:
  stage: build-ami
  image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest
  dependencies:
    - Build Binary
  artifacts:
    paths:
      - build/ami_id
  before_script:
    - apt-get update
    - apt-get install -y ssh jq
  script:
    - mkdir -p build
    # Sub in our variables
    - cat launch-spec.json | jq ".ImageId=env.BUILD_IMAGE" | jq ".NetworkInterfaces[0].Groups=[env.BUILD_GROUP]" | jq ".NetworkInterfaces[0].SubnetId=env.SUBNET_ID" > build/launch-spec.json
    # Make sure our script is executable
    - chmod +x ./ami-baker.sh
    # Make the AMI
    - ./ami-baker.sh
  after_script:
    # Make sure the build instance is terminated. Swallow the error code because we don't want a
    # successful run to choke it
    - ls -R
    - aws ec2 terminate-instances --instance-ids `cat build/build_instance_id` || exit 0

This is where I came to the conclusion I probably should have used python. I downloaded jq to manipulate the JSON, and the script itself is a bit of a mess. You can see it in the above gitlab repo, but I’ll break down the main parts.

Generating a launch specification

The first step is to spin up a machine to with which to generate the image. For that we’ll need a launch specification.

The launch specification file looks like the following

{
  "ImageId": "ami-075a72b1992cb0687",
  "InstanceType": "t3.nano",
  "BlockDeviceMappings": [
    {
      "DeviceName": "/dev/xvda",
      "Ebs": {
        "DeleteOnTermination": true,
        "VolumeSize": 8,
        "VolumeType": "gp3"
      }
    }
  ],
  "Monitoring": {
    "Enabled": false
  },
  "NetworkInterfaces": [
    {
      "AssociatePublicIpAddress": true,
      "DeleteOnTermination": true,
      "DeviceIndex": 0,
      "Groups": [ ],
      "SubnetId": ""
    }
  ]
}

And I’m using jq to set the configurable values, and output it to another file for later use.

cat launch-spec.json \
  | jq ".ImageId=env.BUILD_IMAGE" \
  | jq ".NetworkInterfaces[0].Groups=[env.BUILD_GROUP]" \
  | jq ".NetworkInterfaces[0].SubnetId=env.SUBNET_ID" > build/launch-spec.json

We use our launch-spec.json template file as a base, and fill in the AMI image, security group and subnet ID for the new instance.

I’ve done this in the .gitlab-ci.yml file, I’m not sure that’s 100% correct, so I guess if I use this in a production pipeline I’ll have to think about it some more and decide on the right spot.

Ordering a spot instance

EC2 spot instances are a good alternative for ephemeral CPU resources, and we can do this to lower overall costs. To do this, we need to tell AWS how long we’ll be using that resource and what we want to launch. We’ll use the minimum possible block of time, which is 60 minutes, and pass in the launch spec we’ve generated.

export SPOT_REQUEST_ID=`aws ec2 request-spot-instances \
  --block-duration-minutes 60 \
  --launch-specification file://build/launch-spec.json \
  --query SpotInstanceRequests[0].SpotInstanceRequestId \
  --output text`

Once we’ve ordered the instance, the script just loops waiting for the request to be fulfilled. If it gets to 10 iterations, it will cancel the order and exit with an error.

Copying over resources

The script has generated a key for itself at /tmp/sshkey, so we’ll use the EC2 Instance Connect API to connect:

aws ec2-instance-connect send-ssh-public-key \
  --instance-id $INSTANCE_ID \
  --instance-os-user ec2-user \
  --availability-zone $INSTANCE_AZ \
  --ssh-public-key file:///tmp/sshkey.pub

I’ve found this can be a bit hit or miss with eventual consistency, so you either have to loop with your connection attempts or sleep for a 10-20 seconds before trying to jump on.

The scp command is simple enough

scp -o "StrictHostKeyChecking no" \
  -i /tmp/sshkey ./build/hello-world ./hello-world.service ec2-user@$INSTANCE_IP:~/

We copy the hello-world binary and the hello-world.service systemd unit file.

Configure systemd

Forgive this block, this is another reason I should have used python

ssh -o "StrictHostKeyChecking no" \
    -i /tmp/sshkey ec2-user@$INSTANCE_IP \
    "sudo mkdir -p /opt && sudo mv /home/ec2-user/hello-world /opt/hello-world && sudo mv ~/hello-world.service /etc/systemd/system && sudo systemctl enable hello-world.service && sudo systemctl start hello-world.service"

We use SSH to connect and run a few commands to put the binary into /opt, the systemd service definition into /etc/systemd/system and run the enable and start commands.

Create the AMI

I think this might be better implemented with some sort of callback if I could, but I’d have to trawl through some docs to see if it was even possible. For now, we’ll make the call and just enter a wait loop while EC2 sorts itself out

export AMI_ID=`aws ec2 create-image \
  --instance-id $INSTANCE_ID \
  --name hello-world-$CI_COMMIT_SHORT_SHA \
  --query ImageId \
  --output text`

We give it a name that includes the git short hash, which I like to do so that I know where this resource has come from.

As our final step here, we write the AMI ID into a text file and export it as an artifact to use in the deployment phase.

echo $AMI_ID > build/ami_id

As I keep saying, there’s gotta be a better way to do this. Just have to find it.

Deployment

This hinges on some pretty standard Cloudformation, which I’ve written about in this other blog entry. Check it out if you need a refresher on templates.

Our gitlab job looks like this, and really just runs the aws cloudformation deploy command


Deploy:
  stage: deploy
  image: registry.gitlab.com/gitlab-org/cloud-deploy/aws-base:latest
  when: manual
  dependencies:
    - Build AMI
  script:
    - aws cloudformation deploy --no-fail-on-empty-changeset --template-file ./ServerTemplate.yml --stack-name test-hello-world --parameter-overrides VpcID=$VPC_ID Subnet=$SUBNET_ID ImageID=`cat build/ami_id`

Let’s break down the deploy command a little bit.

aws cloudformation deploy \
  --no-fail-on-empty-changeset \
  --template-file ./ServerTemplate.yml \
  --stack-name test-hello-world \
  --parameter-overrides VpcID=$VPC_ID Subnet=$SUBNET_ID ImageID=`cat build/ami_id`

We use the VPC_ID and SUBNET_ID variables as configured it gitlab, and we cat out the AMI ID we wrote to the build/ami_id file in the previous step

Final Thoughts

This definitely isn’t an ideal solution. There’s a number of things I’d like to do around cleanup, but I’ll probably just let it lie for now and fix it if I need it in the future.