“Wisdom is not a product of schooling but of the lifelong attempt to acquire it.”
- Context Situation
- Issue Summary
- Lessons Learned
- Supporting Information
Our infrastructure uses Amazon Elastic Kubernetes Service which is managed (mostly) by AWS, and scales automatically according to the current load. Over there, we do not only deploy our customer services, but also internal self hosted tooling.
One of those tools is Metabase, which is a key player when it comes to Business Intelligence and Data Analysis. This platform requires high availability (as the rest of the system of course) so that our Data Team can interpret and understand metrics (KPIs) in order to make the right business decisions.
A main event made the Metabase service unavailable (triggering a 503 Http Error):
- An accidental Kubernetes misconfiguration that put a node in an undefined state, esencially in kubernetes language, the node was tainted.
Even though the Metabase pod status seemed to be healthy (status: running), there was NO WAY to access to it. Also as a plus, there was neither a member of the infrastructure team avaialble nor a runbook with instructions to solve the problem.
- No alarms or notification received about an error on a critical tool for the customer.
- When trying to reset the Metabase pod, not clear and specific documentation about how to proceed.
- If restarting the pod does not solve the issue, no clear indications about how reset the node containing the pod.
- If restarting the node does not solve the issue, no further documentation either.
The already mentioned context was critical for understanding key business metrics, and as a result of this situation, some decision making had to be postpone impacting our Customers and Finance area.
In Kubernetes, in specific situations like this one, it is NEEDED to remove a
k8s node from a service, in other words, it is possible to DRAIN a node, which means that containers running on the node (the one to be drained) will be gracefully terminated (& potentially rescheduled on another node).
This action solved the problem:
- Improve the alert system to trigger alerts when tools are down, before a user/customer/client finds out.
- Invest on robust tooling like Prometheus and Grafana to be properly configured.
- Proper setup runbooks and protocols to be able to solve issues by following specific and clear instructions.
- On-call policies are a must in order to act rapidly against a critical issue.