Back to overview
Downtime

Homepage, Query API, and 2 other services are down

Aug 8, 2024 at 2:38pm UTC
Affected services
Homepage
Query API
Search API
VSCode Assets API

Resolved
Aug 12, 2024 at 1:24pm UTC

Detection

On August 8, 2024 at 16:32 CEST, our betteruptime.com monitors reported that open-vsx.org was no longer reachable.

Lead up

In Kubernetes based cluster environments, services like websites are running in isolated docker environments called pods. In order to route incoming traffic for a specific URL (e.g. open-vsx.org) to the right pod, so-called services are defined which select the right pods based on a set of custom labels such as “app=open-vsx-org, environment=production”. Such labels need to be distinct to avoid overlaps with other pods. Open-vsx utilizes Elasticsearch for allowing extensions to be indexed and found. Elasticsearch runs in separate pods. The above-mentioned set of labels was assigned to both the website serving pods and also to the Elasticsearch pods. Despite the overlap, this has worked for several years. Before version 2.14 of the Elasticsearch ECK operator the Elasticsearch pods were not resetting connections coming from haproxy. Haproxy would encounter a timeout, try a different pod and eventually succeed.

Fault

An automatic update of the so called “Elasticsearch ECK operator” which provides specific resources for Elasticsearch pods, triggered a restart of all Elasticsearch pods (3 production pods, 2 staging pods). The update included a change that led to Elasticsearch pods resetting connection requests from haproxy, which led to 502 errors.

Root causes

Usage of non-distinct set of labels for the website and the Elasticsearch pods.
Mitigation and resolution
A first analysis concluded that the website pods were correctly serving content, when being directly accessed. More investigation on the network level followed. After ruling out several possible causes, the non-distinct set of labels was identified. After adding a distinct label to the website pods and the service selector, the open-vsx.org website became reachable again. This confirmed the assumption.

Lessons learnt

Deployments with multiple pods require distinct labels to avoid faulty service routing. Persistent logs are essential for analyzing such incidents. Reaching out to fellow IT colleagues for help provides different domain specific knowledge and increases the chance of faster issue resolution.

Timeline

2024-08-08 16:27 (CEST) Elasticsearch Operator automatic update
2024-08-08 16:32 (CEST) open-vsx.org (production instance) becomes unreachable (betteruptime alerts start)
Monitors recover intermittently in the following hours
2024-08-08 21:11 (CEST) Denis asks for assistance in Slack chat, while analyzing the issue
Several members of the IT team try to analyze the issue
2024-08-08 23:37 (CEST) Denis and Thomas identify and fix the label/selector issue and open-vsx.org becomes reachable again
2024-08-09 14:29 (CEST) Fred commits changes to add specific labels, to avoid elasticsearch pods from being selected to serve open-vsx.org

Updated
Aug 9, 2024 at 4:45am UTC

Query API and VSCode Assets API recovered.

Updated
Aug 9, 2024 at 4:40am UTC

Homepage and Search API recovered.

Updated
Aug 8, 2024 at 2:47pm UTC

VSCode Assets API went down.

Updated
Aug 8, 2024 at 2:47pm UTC

Query API went down.

Updated
Aug 8, 2024 at 2:42pm UTC

Search API went down.

Created
Aug 8, 2024 at 2:38pm UTC

Homepage went down.