diff --git a/dev/prometheus/all/prometheus_targets.yml b/dev/prometheus/all/prometheus_targets.yml
index 2c9115b600b..ffc19646647 100644
--- a/dev/prometheus/all/prometheus_targets.yml
+++ b/dev/prometheus/all/prometheus_targets.yml
@@ -63,3 +63,8 @@
targets:
# github proxy
- host.docker.internal:6090
+- labels:
+ job: otel-collector
+ targets:
+ # opentelemetry collector
+ - host.docker.internal:8888
diff --git a/dev/prometheus/linux/prometheus_targets.yml b/dev/prometheus/linux/prometheus_targets.yml
index ca635c0ca3f..8f94e01133c 100644
--- a/dev/prometheus/linux/prometheus_targets.yml
+++ b/dev/prometheus/linux/prometheus_targets.yml
@@ -63,3 +63,8 @@
targets:
# github proxy
- 127.0.0.1:6090
+- labels:
+ job: otel-collector
+ targets:
+ # opentelemetry collector
+ - host.docker.internal:8888
diff --git a/doc/admin/observability/alerts.md b/doc/admin/observability/alerts.md
index df74195a91d..4c61c4420d8 100644
--- a/doc/admin/observability/alerts.md
+++ b/doc/admin/observability/alerts.md
@@ -7851,3 +7851,161 @@ Generated query for warning alert: `max((rate(src_telemetry_job_total{op="SendEv
+## otel-collector: otel_span_refused
+
+
spans refused per receiver
+ +**Descriptions** + +- warning otel-collector: 1+ spans refused per receiver for 5m0s + +**Next steps** + +- Check logs of the collector and configuration of the receiver +- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otel-span-refused). +- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: + +```json +"observability.silenceAlerts": [ + "warning_otel-collector_otel_span_refused" +] +``` + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +span export failures by exporter
+ +**Descriptions** + +- warning otel-collector: 1+ span export failures by exporter for 5m0s + +**Next steps** + +- Check the configuration of the exporter and if the service being exported is up +- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otel-span-export-failures). +- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: + +```json +"observability.silenceAlerts": [ + "warning_otel-collector_otel_span_export_failures" +] +``` + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +container cpu usage total (1m average) across all cores by instance
+ +**Descriptions** + +- warning otel-collector: 99%+ container cpu usage total (1m average) across all cores by instance + +**Next steps** + +- **Kubernetes:** Consider increasing CPU limits in the the relevant `Deployment.yaml`. +- **Docker Compose:** Consider increasing `cpus:` of the otel-collector container in `docker-compose.yml`. +- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#otel-collector-container-cpu-usage). +- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: + +```json +"observability.silenceAlerts": [ + "warning_otel-collector_container_cpu_usage" +] +``` + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +container memory usage by instance
+ +**Descriptions** + +- warning otel-collector: 99%+ container memory usage by instance + +**Next steps** + +- **Kubernetes:** Consider increasing memory limit in relevant `Deployment.yaml`. +- **Docker Compose:** Consider increasing `memory:` of otel-collector container in `docker-compose.yml`. +- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#otel-collector-container-memory-usage). +- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: + +```json +"observability.silenceAlerts": [ + "warning_otel-collector_container_memory_usage" +] +``` + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +percentage pods available
+ +**Descriptions** + +- critical otel-collector: less than 90% percentage pods available for 10m0s + +**Next steps** + +- Determine if the pod was OOM killed using `kubectl describe pod otel-collector` (look for `OOMKilled: true`) and, if so, consider increasing the memory limit in the relevant `Deployment.yaml`. +- Check the logs before the container restarted to see if there are `panic:` messages or similar using `kubectl logs -p otel-collector`. +- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#otel-collector-pods-available-percentage). +- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: + +```json +"observability.silenceAlerts": [ + "critical_otel-collector_pods_available_percentage" +] +``` + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +The OpenTelemetry collector ingests OpenTelemetry data from Sourcegraph and exports it to the configured backends.
+ +To see this dashboard, visit `/-/debug/grafana/d/otel-collector/otel-collector` on your Sourcegraph instance. + +### OpenTelemetry Collector: Receivers + +#### otel-collector: otel_span_receive_rate + +Spans received per receiver per minute
+ + Shows the rate of spans accepted by the configured reveiver + + A Trace is a collection of spans and a span represents a unit of work or operation. Spans are the building blocks of Traces. + The spans have only been accepted by the receiver, which means they still have to move through the configured pipeline to be exported. + For more information on tracing and configuration of a OpenTelemetry receiver see https://opentelemetry.io/docs/collector/configuration/#receivers. + + See the Exporters section see spans that have made it through the pipeline and are exported. + + Depending the configured processors, received spans might be dropped and not exported. For more information on configuring processors see + https://opentelemetry.io/docs/collector/configuration/#processors. + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100000` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Spans refused per receiver
+ + Shows the amount of spans that have been refused by a receiver. + + A Trace is a collection of spans. A Span represents a unit of work or operation. Spans are the building blocks of Traces. + + Spans can be rejected either due to a misconfigured receiver or receiving spans in the wrong format. The log of the collector will have more information on why a span was rejected. + For more information on tracing and configuration of a OpenTelemetry receiver see https://opentelemetry.io/docs/collector/configuration/#receivers. + +Refer to the [alerts reference](./alerts.md#otel-collector-otel-span-refused) for 1 alert related to this panel. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100001` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Spans exported per exporter per minute
+ + Shows the rate of spans being sent by the exporter + + A Trace is a collection of spans. A Span represents a unit of work or operation. Spans are the building blocks of Traces. + The rate of spans here indicates spans that have made it through the configured pipeline and have been sent to the configured export destination. + + For more information on configuring a exporter for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#exporters. + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100100` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Span export failures by exporter
+ + Shows the rate of spans failed to be sent by the configured reveiver. A number higher than 0 for a long period can indicate a problem with the exporter configuration or with the service that is being exported too + + For more information on configuring a exporter for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#exporters. + +Refer to the [alerts reference](./alerts.md#otel-collector-otel-span-export-failures) for 1 alert related to this panel. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100101` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Cpu usage of the collector
+ + Shows the cpu usage of the OpenTelemetry collector + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Memory allocated to the otel collector
+ + Shows the memory Resident Set Size (RSS) allocated to the OpenTelemetry collector + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Memory used by the collector
+ + Shows how much memory is being used by the otel collector. + + * High memory usage might indicate thad the configured pipeline is keeping a lot of spans in memory for processing + * Spans failing to be sent and the exporter is configured to retry + * A high batch count by using a batch processor + + For more information on configuring processors for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#processors. + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Container missing
+ +This value is the number of times a container has not been seen for more than one minute. If you observe this +value change independent of deployment events (such as an upgrade), it could indicate pods are being OOM killed or terminated for some other reasons. + +- **Kubernetes:** + - Determine if the pod was OOM killed using `kubectl describe pod otel-collector` (look for `OOMKilled: true`) and, if so, consider increasing the memory limit in the relevant `Deployment.yaml`. + - Check the logs before the container restarted to see if there are `panic:` messages or similar using `kubectl logs -p otel-collector`. +- **Docker Compose:** + - Determine if the pod was OOM killed using `docker inspect -f '{{json .State}}' otel-collector` (look for `"OOMKilled":true`) and, if so, consider increasing the memory limit of the otel-collector container in `docker-compose.yml`. + - Check the logs before the container restarted to see if there are `panic:` messages or similar using `docker logs otel-collector` (note this will include logs from the previous and currently running container). + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Container cpu usage total (1m average) across all cores by instance
+ +Refer to the [alerts reference](./alerts.md#otel-collector-container-cpu-usage) for 1 alert related to this panel. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100301` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Container memory usage by instance
+ +Refer to the [alerts reference](./alerts.md#otel-collector-container-memory-usage) for 1 alert related to this panel. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100302` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Filesystem reads and writes rate by instance over 1h
+ +This value indicates the number of filesystem read and write operations by containers of this service. +When extremely high, this can indicate a resource usage problem, or can cause problems with the service itself, especially if high values or spikes correlate with {{CONTAINER_NAME}} issues. + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100303` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +Percentage pods available
+ +Refer to the [alerts reference](./alerts.md#otel-collector-pods-available-percentage) for 1 alert related to this panel. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +