Awesome AWS CodePipeline CI

After several talks at work about the feasibility of using AWS Codebuild and AWS Codepipeline to verify the integrity of our codebase, I decided to give it a try.

We use pull-requests and branching extensively, so one requirement is that we can dynamically pickup branches other than the master branch. AWS Codepipeline only works on a single branch out of the box, so I decided to use Githubs webhooks, AWS APIGateway and AWS Lambda to dynamically support multiple branches:

Architecture

First, you create a master AWS CodePipeline, which will serve as a template for all non-master branches.
Next, you setup an AWS APIGateway & an AWS Lambda function which can create and delete AWS CodePipelines based off of the master pipeline.
Lastly, you wire github webhooks to the AWS APIGateway, so that opening a pull request duplicates the master AWS CodePipeline, and closing the pull request deletes it again.

example image of response time percentiles

Details

AWS Lambda

For the AWS Lambda function I decided to use golang & eawsy, as the combination allows for extremely easy lambda function deployments.
The implementation is straight forward and relies on the AWS go sdk to interface with the AWS CodePipeline API.

One catch here is that the AWS IAM permissions need to be setup in a way to allow the lambda function to manage AWS CodePipelines:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowCodePipelineMgmt",
            "Effect": "Allow",
            "Action": [
                "codepipeline:CreatePipeline",
                "codepipeline:DeletePipeline",
                "codepipeline:GetPipeline",
                "codepipeline:GetPipelineState",
                "codepipeline:ListPipelines",
                "codepipeline:UpdatePipeline",
                "iam:PassRole" 
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

AWS APIGateway

The APIGateway is managed via terraform, and it consists of a single API, where the root resource is wired up to handle webhooks. Github specific headers are transformed so they are accessible in the backend. As Github will call this APIGateway we’ll need to set appropriate Access-Control-Allow-Origin headers, otherwise requests will fail:

resource "aws_api_gateway_integration_response" "webhook" {
  rest_api_id = "${aws_api_gateway_rest_api.gh.id}"
  resource_id = "${aws_api_gateway_rest_api.gh.root_resource_id}"
  http_method = "${aws_api_gateway_integration.webhooks.http_method}"
  status_code = "200"

  response_templates {
    "application/json" = "$input.path('$')"
  }

  response_parameters = {
    "method.response.header.Content-Type" = "integration.response.header.Content-Type"
    "method.response.header.Access-Control-Allow-Origin" = "'*'"
  }

  selection_pattern = ".*"
}

AWS CodePipeline

The AWS CodePipeline serving as template is configured to run on master.
This way all merged pull requests trigger tests on this pipeline, and every pull request itself runs on a separate AWS CodePipeline. This is great because every PR can be checked in parallel.

The current implementation forces all AWS CodePipelines to be the same - it would be interesting to adjust this approach e.g. by fetching the CodePipeline template from the repository to allow pull requests to change this as needed.

AWS CodeBuild

In my example the AWS CodeBuild configuration is static. However one could easily make this dynamic, e.g. by placing AWS CodeBuild configuration files inside the repository. This way the PRs could actually test different build configurations.

Outcome

The approach outlined above works very well. It is reasonable fast and technically brings 100% utilization with it. And it brings great extensibility options to the table: one could easily use this approach to spin up entire per pull-request environments, and tear them down dynamically.
In the future I’m looking forward to working more with this approach, and maybe also abstracting it further for increased reuseability.

The source is available on github.


Running your own ZNC bouncer

Over the holidays I decided to get ride of the Slack Desktop application, one very RAM hungry application I’m using constantly. Moving to the browser client is no option as I have way too many tabs open in all browsers you could imagine. As I’m constantly using the terminal I decided to replace the Slack Desktop application with a command-line interface IRC client, irssi.

Now irssi works great with Slack via Slacks IRC gateway, but I wanted to receive messages while I was offline, too. To my best knowledge this is only possible when running an IRC bouncer.

In this blog post I will give a top level overview on how to run ZNC, an IRC bouncer, on Scaleway, exposing the bouncer via Amazon Route 53. All in all this should cost you 5€ per month.

Overview

We’ll make ZNC publicly accessible under a subdomain, managed by AWS Route 53. The subdomain for ZNC will be pointing to a static Scaleway IP, which we’ll attach to a Scaleway server instance.

Using a static IP for this allows us to cycle the underlying instance any time, without requiring changes on the AWS Route 53 side.

The Scaleway server will be provisioned to run two containers - one of ZNC itself, and a sidecar which takes care of synchronizing data from and to AWS S3 for durable storage.

Note that the choice of AWS S3 for durable storage is an implementation detail that can easily be changed by adjusting the sidecar.

overview of architecture

ZNC container

ZNC is not yet available as a container image. There’s an unmerged PR which works just fine. I’ve published an image on Dockerhub, which is based off of that PR and can be re-build in 6 steps:

$ git clone https://github.com/znc/znc.git /tmp/znc
$ curl -L -o /tmp/znc/Dockerfile https://raw.githubusercontent.com/torarnv/znc/ed457d889db012c645557bd3cb494139807486e8/Dockerfile
$ cd /tmp/znc 
$ git submodule update --init --recursive
$ docker build -t znc:$(git rev-parse HEAD) .

The image I’ve build is based off of alpine:3.4 and the znc data directory is changed to be /opt/znc instead.

sidecar container

The synchronization sidecar ensures that the ZNC configuration as well as all logs end up in AWS S3. If no local data exists it also takes care of restoring data. This ensures that we end up with a working setup when we cycle the server instance, since the data from S3 will be used as seed for the new instance.

The container itself only contains rclone and runs the synchronization periodically via crond, as well as on startup. The script executing the synchronization is just a couple of lines:

# s3-sync.sh
export LOCAL_PATH=/mnt/data

echo Running S3 Sync

if [ "$(ls -A $LOCAL_PATH)" == "" ]; then
  echo "overwriting local."
  rclone sync remote:$S3_REMOTE_PATH $LOCAL_PATH
else
  echo "overwriting remote."
  rclone sync $LOCAL_PATH remote:$S3_REMOTE_PATH
fi

echo S3 Sync is done

rclone requires a configuration file to be present. As we’re fetching AWS credentials from the environment this file is empty except for the AWS region we’re using:

# .rclone.conf
[remote]
type = s3
env_auth = 1
access_key_id =
secret_access_key =
region = eu-central-1
endpoint =
location_constraint = EU

Automation

The entire setup is automated via Hashicorps terraform. I’ve published the entire setup on github as well. The setup consists of three small modules which take care of everything you need:

  • s3, which creates an S3 bucket as well as an IAM used with required permissions to read and write data to the s3 bucket.
  • znc, which creates a static IP and server on Scaleway, and sets up the instance properly by creating systemd units to manage the previously created images.
  • r53, which takes the static ip output of the znc image and sets up a new AWS route53 subdomain
provider "aws" {}

variable "bucket_name" {}

module "s3" {
  source = "./modules/s3"

  s3_bucket_name = "${var.bucket_name}"
}

provider "scaleway" {
  region = "ams1"
}

variable "znc_container_image" {
  default = "nicolai86/znc:b4b085dc2db69b58f2ad3bb4271ff3789e8301b5"
}
variable "sync_container_image" {
  default = "nicolai86/rclone-sync:v0.1.4"
}

module "znc" {
  source = "./modules/znc"

  aws_access_key_id     = "${module.s3.aws_access_key_id}"
  aws_secret_access_key = "${module.s3.aws_secret_access_key}"
  aws_s3_bucket_name    = "${module.s3.aws_s3_bucket_name}"

  znc_container_image  = "${var.znc_container_image}"
  sync_container_image = "${var.sync_container_image}"
}

variable "hosted_zone_id" {}
variable "hostname" {}

module "r53" {
  source = "./modules/r53"

  hosted_zone_id = "${var.hosted_zone_id}"
  service_ip     = "${module.znc.ip}"
  hostname       = "${var.hostname}"
}

closing thoughts

Using containers to deploy ZNC has the nice upside of being able to easily add native modules into the buildstep. All you need to do is to copy the source code of the module into the modules folder.

The Scaleway server instance has a deliberatly minimal configuration as I’d like to move the entire setup onto an orchestration system like kubernetes someday. Using a sidecar container is a preparation for this, as it allows me to move the entire setup over without big adjustments. Operationally speaking a move to k8s would also improve many blind spots of this setup: no self-healing, deployments with downtime, monitoring, …

Since I’m using AWS S3 as backup medium it’s very easy to configure the entire setup locally, synchronize the data back to S3, and then create the production setup using terraform: all data is fetched from S3 and things continue running.

One thing that I still need to address is a publicly valid SSL certificate, which requires some letsencrypt integration. As this can very easily be integrated into a k8s cluster it’s left out for now: the setup runs on a self signed certificate. Another thing which needs adjustments is the synchronization: in the worst case I’ll lose 15 minutes worth of data, when the server crashes and nothing can be restored.


Response times percentiles from opentracing

At work we have started using opentracing with a zipkin backend and elasticsearch as storage layer. Zipkin creates per-day indices inside elasticsearch, meaning we can use the raw data to generate correct response time percentiles for individual services grouped by date.

After an initial draft which used a HdrHistogram on the client side Zachary Tong, an elastic.co employee, suggested on twitter to use the build-in HdrHistogram in elasticsearch instead.

The resulting code is a little shorter, and runs much faster due to the fact that we cut down the amount of data needed to be transfered.

Let’s take the latest version apart:

Our elasticsearch backend is running on AWS so we need to configure the elastic client to skip healthchecks:

client, err := elastic.NewClient(
  elastic.SetURL("http://127.0.0.1:9200"),
  elastic.SetHealthcheck(false),
  elastic.SetSniff(false),
)
if client == nil || err != nil {
  panic(fmt.Errorf("Failed retrieving an client: %#v", err))
}

Also note that our elasticsearch cluster is configured to be only accessible via v4 signed requests, which is achieved by running an aws signing proxy locally.

Now that we have a properly configured client we need to create an elasticsearch query which allows us to look only at traces for a single service. Opentracing by default adds a binaryAnnotation like this:

{
  "endpoint": {
    "serviceName": "service-a"
  }
}

knowning that these binaryAnnotations will be set for all traces, we can create a query which filters out all unwanted documents:

service := "service-a"

b := elastic.NewBoolQuery()
b.Must(elastic.NewMatchQuery("binaryAnnotations.endpoint.serviceName", service))

q := elastic.NewNestedQuery(
  "binaryAnnotations",
  b,
)

Now that we can filter documents by individual services we need to instruct elasticsearch to execute a percentiles aggregation. The elastic client has support for percentiles aggregations, but these use TDigest by default. For HdrHistograms we need to add a hdr key.

To do this in Go, we need to fulfill the elastic.Aggregation interface ourselves:

type Aggregation interface {
  // Source returns a JSON-serializable aggregation that is a fragment
  // of the request sent to Elasticsearch.
  Source() (interface{}, error)
}

This can easily be achieved by struct embedding:

type hdrPercentilesAggregation struct {
  *elastic.PercentilesAggregation
}

func (c *hdrPercentilesAggregation) Source() (interface{}, error) {
  s, err := c.PercentilesAggregation.Source()
  if err != nil {
    return nil, err
  }

  m := s.(map[string]interface{})
  percentiles := m["percentiles"].(map[string]interface{})
  percentiles["hdr"] = map[string]interface{}{
    "number_of_significant_value_digits": 3,
  }

  return m, nil
}

With this change we can instruct elasticsearch to calculate response time percentiles using HdrHistograms, knowing that opentracing-go adds a duration key:

agg := elastic.NewPercentilesAggregation()
agg.Field("duration")

sr, err := client.Search().Index(index).Query(q).Aggregation("duration", &hdrPercentilesAggregation{agg}).Do()
if err != nil {
  return nil, err
}
res, found := sr.Aggregations.Percentiles("duration")
if !found {
  return nil, fmt.Errorf("Missing aggregation %q from result set", "duration")
} 

The requested percentiles are inside the res.Values map.

It’s straight forward to generate images from this data via the really great github.com/gonum/plot package - but see for yourself.

the gist contains a binary that generates csvs or pngs for a specific service & time range:

$ go run plot.go main.go -csv=false -img=true -prefix zipkin- service-a 2016-12-01 2016-12-20

example image of response time percentiles


Scaleway on terraform: remote-exec provisioners

In this blog post I want to explore two options of using terraform and the remote-exec provisioner with the new Scaleway cloud provider.

using Scaleway

First, signup for Scaleway. Once you have a Scaleway account, export the required credentials to your environment like this:

export SCALEWAY_ACCESS_KEY=<your-access-key> 
export SCALEWAY_ORGANIZATION=<your-organization-id>

You can find out both information easily by using the scw cli; it’ll write the information to the ~/.scwrc file.

Now you can use the scaleway provider like this:

provider "scaleway" {}

resource "scaleway_server" "server" {
  name = "my-server"
  type = "C1"
  image = "eeb73cbf-78a9-4481-9e38-9aaadaf8e0c9" # ubuntu 16.06
}

You’re now ready to manage your scaleway infrastructure with terraform!

public hosts

By default, the scaleway_server resource will create internal servers only, meaning the servers won’t have a public ip. In order to use remote-exec however, the server must be accessible.

The easiest way to achieve this is by exposing your server using the dynamic_ip_required attribute:

provider "scaleway" {}

resource "scaleway_server" "server" {
  name  = "my-server"
  type  = "C1"
  image = "eeb73cbf-78a9-4481-9e38-9aaadaf8e0c9" # ubuntu 16.06

  dynamic_ip_required = true

  provisioner "remote-exec" {
    inline = "echo hello world"
  }
}

Now your server will get a public ip assigned and remote-exec will work out of the box!

jump hosts

When you don’t want to expose your servers you can setup a publicly accessible jump host, which then can be used to access your internal servers:

provider "scaleway" {}

resource "scaleway_server" "jump-host" {
  name  = "my-jump-host"
  type  = "C1"
  image = "eeb73cbf-78a9-4481-9e38-9aaadaf8e0c9" # ubuntu 16.06

  dynamic_ip_required = true
}

resource "scaleway_server" "server" {
  type  = "C1"
  image = "eeb73cbf-78a9-4481-9e38-9aaadaf8e0c9" # ubuntu 16.06

  connection {
    type         = "ssh"
    user         = "root"
    host         = "${self.private_ip}"
    bastion_host = "${scaleway_server.jump-host.public_ip}"
    bastion_user = "root"
    agent        = true
  }

  provisioner "remote-exec" {
    inline = "echo hello world"
  }
}

this way, only your jump host is publicly accessible and all other servers will remain internal.

That’s it for now. Enjoy Scaleway on terraform :)