The Conversation launched over six years ago and our continuous integration infrastructure has evolved over that time.

For the first few years we ran happily ran Jenkins on a single physical server. Our feature set grew rapidly and by early 2013 the build time for our core CMS code base peaked at an unhelpful 28 minutes. We explored options for speeding up our feedback loop and eventually ended up with a 6 minute parallel build running on three physical build servers, coordinated by Buildkite.

It feels like no development team is ever happy with their build time, but we found 6 minutes a pragmatic average that avoided most long waits for a result.

The challenge from there was avoiding the inevitable build time increase as new tests were added over time, and that required a way to monitor the long term trends.

We use librato for monitoring and it felt like a good fit for this situation too. Buildkite provide webhooks and we have a lita-powered slack bot that is well suited to hosting glue-code like this.

The solution we settled on provides a librato dashboard displaying the four-week build time trend for each of our codebases. Each chart includes the raw data that fluctuates a bit, plus a smoothed line based on the average of the last ten builds.

Our build time dashboard for May 2017. CC BY-NC-ND

There were three puzzle pieces:

  1. Adding lita-buildkite to our lita bot and configuring buildkite to send it webhook events
  2. Adding lita-librato to our lita bot to simplify submitting metrics
  3. Writing a custom lita handler to pipe the timing data from buildkite webhooks across to librato

The glue code isn’t in a public repo, but it’s small enough to replicate here:

require "lita"
require "lita-buildkite"

module Lita
  module Handlers
    class BuildkiteJobStats < Handler
      on :buildkite_build_finished, :build_finished

      def build_finished(payload)
        event = payload[:event]
        record_build_stats(event) if event.build_branch == "master"
      end

      private

      def record_build_stats(event)
        runtime_seconds = (event.build_finished_at - event.build_started_at).to_i
        robot.trigger(:librato_submit, name: "ci.build-runtime.#{event.pipeline_slug}", type: :gauge, source: "buildkite", value: runtime_seconds)

        total_seconds = (event.build_finished_at - event.build_created_at).to_i
        robot.trigger(:librato_submit, name: "ci.build-totaltime.#{event.pipeline_slug}", type: :gauge, source: "buildkite", value: total_seconds)
      end
    end

    Lita.register_handler(BuildkiteJobStats)
  end
end

It’s only 27 lines, but it works.

In May 2017 we upgraded our standard operating environment from Ubuntu 14.04 to Ubuntu 16.04. When the upgrade was applied to our build servers we inadvertently changed a postgresql setting that added over a minute to the build. Within a few days the increase became obvious on the trend line and we identified then resolved the issue.

Total build time for our core CMS over a four week period. The period of slower builds was due to build server misconfiguration after an upgrade to the host operating system. CC BY-NC-ND

As an extra safety net we have librato configured to alert us via slack if the average build time passed a threshold.

When the average build time for our project exceeds a threshold, we’re alerted via slack. CC BY-NC-ND

We’ve had this setup in place for about a year and we’ve had at least three occasions where configuration or spec changes have impacted build time significantly and been reversed once discovered.

The ConversationWith a bit of luck, the bad old days of 28 minute builds will remain a bad memory.

This article was originally published on The Conversation. Read the original article.