diff --git a/doc/admin/install/kubernetes/configure.md b/doc/admin/install/kubernetes/configure.md index b67b6716d2c..b639f8e203a 100644 --- a/doc/admin/install/kubernetes/configure.md +++ b/doc/admin/install/kubernetes/configure.md @@ -800,6 +800,35 @@ spec: value: bob ``` +## Filtering cAdvisor metrics + +Due to how cAdvisor works, Sourcegraph's cAdvisor deployment can pick up metrics for services unrelated to the Sourcegraph deployment running on the same nodes as Sourcegraph services. +[Learn more](../../../dev/background-information/observability/cadvisor.md#identifying-containers). + +To work around this, update your `prometheus.ConfigMap.yaml` to target your [namespaced Sourcegraph deployment](#namespaced-overlay) by uncommenting the below `metric_relabel_configs` entry and updating it with the appropriate namespace. +This will cause Prometheus to drop all metrics *from cAdvisor* that are not from services in the desired namespace. + +```yaml +apiVersion: v1 +data: + prometheus.yml: | + # ... + + metric_relabel_configs: + # cAdvisor-specific customization. Drop container metrics exported by cAdvisor + # not in the same namespace as Sourcegraph. + # Uncomment this if you have problems with certain dashboards or cAdvisor itself + # picking up non-Sourcegraph services. Ensure all Sourcegraph services are running + # within the Sourcegraph namespace you have defined. + # The regex must keep matches on '^$' (empty string) to ensure other metrics do not + # get dropped. + - source_labels: [container_label_io_kubernetes_pod_namespace] + regex: ^$|ns-sourcegraph # ensure this matches with namespace declarations + action: keep + + # ... +``` + ## Troubleshooting See the [Troubleshooting docs](troubleshoot.md). diff --git a/doc/admin/install/kubernetes/troubleshoot.md b/doc/admin/install/kubernetes/troubleshoot.md index 9eab1a411e7..921d92aa477 100644 --- a/doc/admin/install/kubernetes/troubleshoot.md +++ b/doc/admin/install/kubernetes/troubleshoot.md @@ -65,11 +65,10 @@ This indicates the instance is getting rate-limited by Docker Hub([link](https:/ - [**OPTIONAL**] Upgrade your account to a Docker Pro or Team subscription ([See Docker Hub for more information](https://www.docker.com/increase-rate-limits)) -### Prometheus Pod is constantly down when using the namespace overlays. - -This is most likely due to cadvisor picking up other metrics from the cluster. -You can confirm this theory by checking your [prometheus.ConfigMap.yaml](https://sourcegraph.com/github.com/sourcegraph/deploy-sourcegraph@3.27/-/blob/base/prometheus/prometheus.ConfigMap.yaml#L248-250) file, where the `source_labels: [container_label_io_kubernetes_pod_namespace]` fields under `metric_relabel_configs` should be commented out and the `regex` field must be updated with your namespace. +### Irrelevant cAdvisor metrics are causing strange alerts and performance issues. +This is most likely due to cAdvisor picking up other metrics from the cluster. +A workaround is available: [Filtering cAdvisor metrics](./configure.md#filtering-cadvisor-metrics). ### I don't see any metrics on my Grafana Dashboard. diff --git a/doc/dev/background-information/observability/cadvisor.md b/doc/dev/background-information/observability/cadvisor.md index 40bdf5d284c..b861b199fbf 100644 --- a/doc/dev/background-information/observability/cadvisor.md +++ b/doc/dev/background-information/observability/cadvisor.md @@ -18,7 +18,8 @@ How relevant containers are identified from exported cAdvisor metrics is documen Because cAdvisor run on a *machine* and exports *container* metrics, standard strategies for identifying what container a metric belongs to (such as Prometheus scrape target labels) cannot be used, because all the metrics look like they belong to cAdvisor. Making things complicated is how containers are identified on various environments (namely Kubernetes and docker-compose) varies, sometimes due to characteristics of the environments and sometimes due to naming inconsistencies within Sourcegraph. -Variations in how cAdvisor generates the `name` label it provides also makes things difficult (in some environments, it cannot generate one at all!), so we might have to create a custom naming strategy. +Variations in how cAdvisor generates the `name` label it provides also makes things difficult (in some environments, it cannot generate one at all!). +This means that cAdvisor can pick up non-Sourcegraph metrics, which can be problematic - see [known issues](#known-issues) for more details and current workarounds. ## Available metrics @@ -28,7 +29,9 @@ In the list, the column `-disable_metrics parameter` indicates the "group" the m Container runtime and deployment environment compatability for various metrics seem to be grouped by these groups - before using a metric, ensure that the metric is supported in all relevant environments (for example, both Docker and `containerd` container runtimes). Support is generally poorly documented, but a search through the [cAdvisor repository issues](https://github.com/google/cadvisor/issues) might provide some hints. -### Known issues +## Known issues -- `disk` metrics are not available in `containerd`: [cadvisor#2785](https://github.com/google/cadvisor/issues/2785) -- `diskIO` metrics do not seem to be available in Kubernetes: [sourcegraph#12163](https://github.com/sourcegraph/sourcegraph/issues/12163) +- cAdvisor can pick up non-Sourcegraph metrics (can cause issues with [our built-in observability](../../../admin/observability/index.md) and, in extreme cases, cause cAdvisor and Prometheus performance issues if the number of metrics is very large) due to how we currently [identitify containers](#identifying-containers): [sourcegraph#17365](https://github.com/sourcegraph/sourcegraph/issues/17365) ([Kubernetes workaround](../../../admin/install/kubernetes/configure.md#filtering-cadvisor-metrics)) +- Metrics issues + - `disk` metrics are not available in `containerd`: [cadvisor#2785](https://github.com/google/cadvisor/issues/2785) + - `diskIO` metrics do not seem to be available in Kubernetes: [sourcegraph#12163](https://github.com/sourcegraph/sourcegraph/issues/12163)