Don't Shoot Jugglers; They are Ashamed Already: How a Series of Screw-Ups Has Led Us to Improved Quality Assurance

A good programmer is someone who always looks both ways before crossing a one-way street.
Doug Linder

There is a saying on programming. It goes like, programming is as easy as riding a bicycle. A burning bicycle. While you are burning too. And everything around is on fire. And you’re in hell.

The same is true for software development in general, but there is one detail. In this case, while cycling, the developers also juggle a dozen things, from rubber balls and switched on chainsaws to a couple of chimpanzees.

Nearby, other jugglers are circling the blazing arena: analysts, data scientists, testers, etc. They throw objects at each other as new chainsaws and geranium pots are falling off the dome.

Ace jugglers have engraved the trajectories of objects in their mind and can track colleagues with peripheral vision by just a flare in the corner of their eye. They have developed animal instincts and superhuman senses. But as in any other matter, the path to mastery is paved with mistakes. I want to share how our team was learning to juggle during one product development.

I am talking about the quality assurance of a system for publishers that predicts advertising campaigns delivery on their websites. The system simulates campaigns with any custom targeting parameters and immediately calculates whether the publisher can fulfil the contract with the advertiser and how to adjust targeting parameters to achieve the goal.

When I joined the team several years ago, I was pleasantly surprised by how well folks had set up quality control. They’d covered almost all UI functionality with autotests and would check non-automated cases manually using formalised test cases. The developers wrote unit and integration tests diligently and tested new features in the early stages of development.

Coming from a company where there was nothing of the kind, I looked at these wonders of progress with eyes wide open, like a Neanderthal at a spaceship. What issues could seep through such an advanced quality control? The processes seemed flawless, and the future looked bright and untroubled.

January 2018. “Guys, we are in trouble: there is an issue in production!”

This simple phrase opened one sweaty workday. The team did well and restored the system quickly, though the reason was unexpected.

The client sells ad impressions via an ad server. We integrate with this server through the API, download data on sold impressions, and build a forecast based on them.

Every morning, before the workday starts, the system is learning; it updates the data on impressions via the API and improves the calculations. During this phase, forecasts are not available by design.

However, this time, data synchronization took too long, so the duty support engineer started manual system training on incomplete data for the system to be ready on time. While the algorithms were learning, the client data had finally synchronised, and the system decided to run one more learning cycle, now on the complete data package. Forecasting ended up being unavailable.

This resulted in writing automated health checks for each client’s production system. Immediately after the scheduled learning phase, they would start checking whether the forecasting is available through the API and UI. If the system has not calculated the forecast on time, the health checks alert the support channel in Slack.

Lesson learned: quality control does not end after the system is delivered to the customer. At every moment in time, we must be aware of the current state of the system.

March 2018. “Guys, the forecast on production isn’t working, again.”

What the hell? Health checks were silent, and apart from the client’s mood, everything seemed to be in order. We felt a touch of deja vu months after the previous incident.

It turned out that the synchronization with the ad server was abnormally delayed again, the learning process started late, and the system got back on track 40 minutes later.

Then why weren’t the health checks alerting? - we asked ourselves and found out that our CI/CD server Teamcity, which would launch health checks, was standing all gloomy without a single build agent (a machine that performs CI/CD server tasks) while health checks had gathered cheerfully into a queue.

The cause was that the planned automatic restart just decided not to restart the build agents for whatever reason. The monitoring of the internal infrastructure was in its infancy, and we did not detect the problem in time.

After that, the developers taught the applications to send metrics to Graphite, and the admins set up Grafana and alerts. Now, if health checks suddenly stop coming, Grafana considerately taps us on the shoulder: “Folks, you have a problem.” Peace and tranquillity reigned again.

Lessons learned:

  1. No notification about a problem does not mean there is no problem.
  2. No notification about the absence of a problem definitely means there is a problem.

April 2018. “Guys, your system is not available,”

— the client relayed to us. Blushing fiercely, we stared sternly at the health checks - everything was alright. We checked the system manually - it was in order. We checked the availability of the system from the external network - voila! The line was dead; the system was inaccessible.

The alarmed infrastructure team reported civilian casualties from Roskomnadzor (Russian Federal Service for Supervision of Communications, Information Technology and Mass Media) trying to out Telegram. In the heat of the fight, some of our machines’ IP addresses have fallen victim to blocking.

Mentioning Roskomnadzor not angrily, rather reserved, the infrastructure team extinguished the fire. Our QA team concluded that it was unreasonable to keep health checks and the system under test on the same network. That is how we have started checking the system availability from different parts of the world.

Lesson learned: the fact that the system is available to us does not guarantee that it is available to the client. One should at least check the state of the system from the external network and, if possible, collect client-side metrics.

September 2018. The Ministry of Health warns: alerts are addictive

The Graphite integration was beckoning and enchanting, and the developers would add more and more metrics to it. And each new metric would bring more and more alerts ranging from “‘tis but a scratch” to “WE WILL DIE, WE’RE ALL GONNA DIE, FOR GOD’S SAKE DO SOMETHING, NOW!“.

Strangely, the neverending flood of notifications started to overwhelm our vigilant support engineers. But this time, we did not wait until we overlooked something crucial and divided alerts into three groups:

  1. The most critical problems that affect the client’s operation, forecast accuracy, etc., i.e. requires an immediate response.
  2. Problems that do not require urgent attention but risk escalating to the first type in a few days.
  3. Notifications on data anomalies or atypical second-tier metrics values. These help us detect the seeds of the problems, which could brew into the erupting volcano in weeks or months.

We wrote a playbook on solving and escalating the problem for each alert in case the guide did not work.

We also implemented the Opsgenie system, which calls and messages the duty engineer with an issue description. If the engineer, for some reason, does not respond, Opsgenie hassles other people until someone takes care of the problem.

At the same time, we started an orderly withdrawal from old monitoring systems such as Zabbix and Sentry and custom email alerts as, to our surprise, support engineers were not stoked to monitor several different systems at once.

Lessons learned:

  1. System health metrics alert us of the problem at its very roots.
  2. An uncontrolled stream of alerts causes us to miss critical issues.
  3. We should cut down the number of notification channels, structure the alert streamline, and define algorithms for emergencies.

Instead of a conclusion

What have we gained as a result (apart from paranoia)?

We’ve made the product ecosystem much more stable.

We’ve learned how to control the product quality after its delivery to the client, how to deploy high-quality check-ups on high-quality infrastructure, and think through situations like “what will happen if this test does not work?”

At last, we’ve learned to draw accurate conclusions from mistakes.

And most importantly, we’ve adopted a proactive approach - predicting problems with production and covering anticipated risks with automated tests.

It is impossible to foresee everything, but we continue to steadily gain ground against Her Majesty Uncertainty.

Pavel 
Kiselev
Pavel KiselevQA Lead