The Conversation launched over six years ago and our continuous integration infrastructure has evolved over that time.
For the first few years we ran happily ran Jenkins on a single physical server. Our feature set grew rapidly and by early 2013 the build time for our core CMS code base peaked at an unhelpful 28 minutes. We explored options for speeding up our feedback loop and eventually ended up with a 6 minute parallel build running on three physical build servers, coordinated by Buildkite.
It feels like no development team is ever happy with their build time, but we found 6 minutes a pragmatic average that avoided most long waits for a result.
The challenge from there was avoiding the inevitable build time increase as new tests were added over time, and that required a way to monitor the long term trends.
We use librato for monitoring and it felt like a good fit for this situation too. Buildkite provide webhooks and we have a lita-powered slack bot that is well suited to hosting glue-code like this.
The solution we settled on provides a librato dashboard displaying the four-week build time trend for each of our codebases. Each chart includes the raw data that fluctuates a bit, plus a smoothed line based on the average of the last ten builds.
There were three puzzle pieces:
The glue code isn’t in a public repo, but it’s small enough to replicate here:
require "lita"
require "lita-buildkite"
module Lita
module Handlers
class BuildkiteJobStats < Handler
on :buildkite_build_finished, :build_finished
def build_finished(payload)
event = payload[:event]
record_build_stats(event) if event.build_branch == "master"
end
private
def record_build_stats(event)
runtime_seconds = (event.build_finished_at - event.build_started_at).to_i
robot.trigger(:librato_submit, name: "ci.build-runtime.#{event.pipeline_slug}", type: :gauge, source: "buildkite", value: runtime_seconds)
total_seconds = (event.build_finished_at - event.build_created_at).to_i
robot.trigger(:librato_submit, name: "ci.build-totaltime.#{event.pipeline_slug}", type: :gauge, source: "buildkite", value: total_seconds)
end
end
Lita.register_handler(BuildkiteJobStats)
end
end
It’s only 27 lines, but it works.
In May 2017 we upgraded our standard operating environment from Ubuntu 14.04 to Ubuntu 16.04. When the upgrade was applied to our build servers we inadvertently changed a postgresql setting that added over a minute to the build. Within a few days the increase became obvious on the trend line and we identified then resolved the issue.
As an extra safety net we have librato configured to alert us via slack if the average build time passed a threshold.
We’ve had this setup in place for about a year and we’ve had at least three occasions where configuration or spec changes have impacted build time significantly and been reversed once discovered.
With a bit of luck, the bad old days of 28 minute builds will remain a bad memory.
This article was originally published on The Conversation. Read the original article.