Back to overview
Downtime

Homepage, Query API, and 2 other services are down

Aug 08 at 10:38am EDT
Affected services
Homepage
Query API
Search API
VSCode Assets API

Resolved
Aug 12 at 09:24am EDT

Detection

On August 8, 2024 at 16:32 CEST, our betteruptime.com monitors reported that open-vsx.org was no longer reachable.

Lead up

In Kubernetes based cluster environments, services like websites are running in isolated docker environments called pods. In order to route incoming traffic for a specific URL (e.g. open-vsx.org) to the right pod, so-called services are defined which select the right pods based on a set of custom labels such as “app=open-vsx-org, environment=production”. Such labels need to be distinct to avoid overlaps with other pods. Open-vsx utilizes Elasticsearch for allowing extensions to be indexed and found. Elasticsearch runs in separate pods. The above-mentioned set of labels was assigned to both the website serving pods and also to the Elasticsearch pods. Despite the overlap, this has worked for several years. Before version 2.14 of the Elasticsearch ECK operator the Elasticsearch pods were not resetting connections coming from haproxy. Haproxy would encounter a timeout, try a different pod and eventually succeed.

Fault

An automatic update of the so called “Elasticsearch ECK operator” which provides specific resources for Elasticsearch pods, triggered a restart of all Elasticsearch pods (3 production pods, 2 staging pods). The update included a change that led to Elasticsearch pods resetting connection requests from haproxy, which led to 502 errors.

Root causes

Usage of non-distinct set of labels for the website and the Elasticsearch pods.
Mitigation and resolution
A first analysis concluded that the website pods were correctly serving content, when being directly accessed. More investigation on the network level followed. After ruling out several possible causes, the non-distinct set of labels was identified. After adding a distinct label to the website pods and the service selector, the open-vsx.org website became reachable again. This confirmed the assumption.

Lessons learnt

Deployments with multiple pods require distinct labels to avoid faulty service routing. Persistent logs are essential for analyzing such incidents. Reaching out to fellow IT colleagues for help provides different domain specific knowledge and increases the chance of faster issue resolution.

Timeline

2024-08-08 16:27 (CEST) Elasticsearch Operator automatic update
2024-08-08 16:32 (CEST) open-vsx.org (production instance) becomes unreachable (betteruptime alerts start)
Monitors recover intermittently in the following hours
2024-08-08 21:11 (CEST) Denis asks for assistance in Slack chat, while analyzing the issue
Several members of the IT team try to analyze the issue
2024-08-08 23:37 (CEST) Denis and Thomas identify and fix the label/selector issue and open-vsx.org becomes reachable again
2024-08-09 14:29 (CEST) Fred commits changes to add specific labels, to avoid elasticsearch pods from being selected to serve open-vsx.org

Updated
Aug 09 at 12:45am EDT

Query API and VSCode Assets API recovered.

Updated
Aug 09 at 12:40am EDT

Homepage and Search API recovered.

Updated
Aug 08 at 10:47am EDT

VSCode Assets API went down.

Updated
Aug 08 at 10:47am EDT

Query API went down.

Updated
Aug 08 at 10:42am EDT

Search API went down.

Created
Aug 08 at 10:38am EDT

Homepage went down.