Homepage, Query API, and 2 other services are down
Resolved
Aug 12 at 09:24am EDT
Detection
On August 8, 2024 at 16:32 CEST, our betteruptime.com monitors reported that open-vsx.org was no longer reachable.
Lead up
In Kubernetes based cluster environments, services like websites are running in isolated docker environments called pods. In order to route incoming traffic for a specific URL (e.g. open-vsx.org) to the right pod, so-called services are defined which select the right pods based on a set of custom labels such as “app=open-vsx-org, environment=production”. Such labels need to be distinct to avoid overlaps with other pods. Open-vsx utilizes Elasticsearch for allowing extensions to be indexed and found. Elasticsearch runs in separate pods. The above-mentioned set of labels was assigned to both the website serving pods and also to the Elasticsearch pods. Despite the overlap, this has worked for several years. Before version 2.14 of the Elasticsearch ECK operator the Elasticsearch pods were not resetting connections coming from haproxy. Haproxy would encounter a timeout, try a different pod and eventually succeed.
Fault
An automatic update of the so called “Elasticsearch ECK operator” which provides specific resources for Elasticsearch pods, triggered a restart of all Elasticsearch pods (3 production pods, 2 staging pods). The update included a change that led to Elasticsearch pods resetting connection requests from haproxy, which led to 502 errors.
Root causes
Usage of non-distinct set of labels for the website and the Elasticsearch pods.
Mitigation and resolution
A first analysis concluded that the website pods were correctly serving content, when being directly accessed. More investigation on the network level followed. After ruling out several possible causes, the non-distinct set of labels was identified. After adding a distinct label to the website pods and the service selector, the open-vsx.org website became reachable again. This confirmed the assumption.
Lessons learnt
Deployments with multiple pods require distinct labels to avoid faulty service routing. Persistent logs are essential for analyzing such incidents. Reaching out to fellow IT colleagues for help provides different domain specific knowledge and increases the chance of faster issue resolution.
Timeline
2024-08-08 16:27 (CEST) Elasticsearch Operator automatic update
2024-08-08 16:32 (CEST) open-vsx.org (production instance) becomes unreachable (betteruptime alerts start)
Monitors recover intermittently in the following hours
2024-08-08 21:11 (CEST) Denis asks for assistance in Slack chat, while analyzing the issue
Several members of the IT team try to analyze the issue
2024-08-08 23:37 (CEST) Denis and Thomas identify and fix the label/selector issue and open-vsx.org becomes reachable again
2024-08-09 14:29 (CEST) Fred commits changes to add specific labels, to avoid elasticsearch pods from being selected to serve open-vsx.org
Affected services
Query API
VSCode Assets API
Updated
Aug 09 at 12:45am EDT
Query API and VSCode Assets API recovered.
Affected services
Query API
VSCode Assets API
Updated
Aug 09 at 12:40am EDT
Homepage and Search API recovered.
Affected services
Search API
Homepage
Updated
Aug 08 at 10:47am EDT
VSCode Assets API went down.
Affected services
VSCode Assets API
Updated
Aug 08 at 10:47am EDT
Query API went down.
Affected services
Query API
Updated
Aug 08 at 10:42am EDT
Search API went down.
Affected services
Search API
Created
Aug 08 at 10:38am EDT
Homepage went down.
Affected services
Homepage