diff --git a/doc/admin/observability/alerts.md b/doc/admin/observability/alerts.md index 25b9c5262ab..2b44cfbcc04 100644 --- a/doc/admin/observability/alerts.md +++ b/doc/admin/observability/alerts.md @@ -7792,6 +7792,68 @@ Generated query for warning alert: `max((sum by (exporter) (rate(otelcol_exporte
+## otel-collector: otelcol_exporter_enqueue_failed_spans + +

exporter enqueue failed spans

+ +**Descriptions** + +- warning otel-collector: 0+ exporter enqueue failed spans for 5m0s + +**Next steps** + +- Check the configuration of the exporter and if the service being exported is up. This may be caused by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors. +- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otelcol-exporter-enqueue-failed-spans). +- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: + +```json +"observability.silenceAlerts": [ + "warning_otel-collector_otelcol_exporter_enqueue_failed_spans" +] +``` + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +
+Technical details + +Generated query for warning alert: `max((sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~"^.*"}[1m]))) > 0)` + +
+ +
+ +## otel-collector: otelcol_processor_dropped_spans + +

spans dropped per processor per minute

+ +**Descriptions** + +- warning otel-collector: 0+ spans dropped per processor per minute for 5m0s + +**Next steps** + +- Check the configuration of the processor +- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otelcol-processor-dropped-spans). +- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: + +```json +"observability.silenceAlerts": [ + "warning_otel-collector_otelcol_processor_dropped_spans" +] +``` + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +
+Technical details + +Generated query for warning alert: `max((sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))) > 0)` + +
+ +
+ ## otel-collector: container_cpu_usage

container cpu usage total (1m average) across all cores by instance

diff --git a/doc/admin/observability/dashboards.md b/doc/admin/observability/dashboards.md index 16d8e3f7f16..b93a8443e3e 100644 --- a/doc/admin/observability/dashboards.md +++ b/doc/admin/observability/dashboards.md @@ -30164,6 +30164,94 @@ Query: `sum by (exporter) (rate(otelcol_exporter_send_failed_spans[1m]))`
+### OpenTelemetry Collector: Queue Length + +#### otel-collector: otelcol_exporter_queue_capacity + +

Exporter queue capacity

+ +Shows the the capacity of the retry queue (in batches). + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +
+Technical details + +Query: `sum by (exporter) (rate(otelcol_exporter_queue_capacity{job=~"^.*"}[1m]))` + +
+ +
+ +#### otel-collector: otelcol_exporter_queue_size + +

Exporter queue size

+ +Shows the current size of retry queue + +This panel has no related alerts. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +
+Technical details + +Query: `sum by (exporter) (rate(otelcol_exporter_queue_size{job=~"^.*"}[1m]))` + +
+ +
+ +#### otel-collector: otelcol_exporter_enqueue_failed_spans + +

Exporter enqueue failed spans

+ +Shows the rate of spans failed to be enqueued by the configured exporter. A number higher than 0 for a long period can indicate a problem with the exporter configuration + +Refer to the [alerts reference](./alerts.md#otel-collector-otelcol-exporter-enqueue-failed-spans) for 1 alert related to this panel. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +
+Technical details + +Query: `sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~"^.*"}[1m]))` + +
+ +
+ +### OpenTelemetry Collector: Processors + +#### otel-collector: otelcol_processor_dropped_spans + +

Spans dropped per processor per minute

+ +Shows the rate of spans dropped by the configured processor + +Refer to the [alerts reference](./alerts.md#otel-collector-otelcol-processor-dropped-spans) for 1 alert related to this panel. + +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance. + +*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* + +
+Technical details + +Query: `sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))` + +
+ +
+ ### OpenTelemetry Collector: Collector resource usage #### otel-collector: otel_cpu_usage @@ -30174,7 +30262,7 @@ Shows CPU usage as reported by the OpenTelemetry collector. This panel has no related alerts. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* @@ -30195,7 +30283,7 @@ Shows the allocated memory Resident Set Size (RSS) as reported by the OpenTeleme This panel has no related alerts. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100401` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* @@ -30222,7 +30310,7 @@ For more information on configuring processors for the OpenTelemetry collector s This panel has no related alerts. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100402` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* @@ -30253,7 +30341,7 @@ value change independent of deployment events (such as an upgrade), it could ind This panel has no related alerts. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100500` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* @@ -30272,7 +30360,7 @@ Query: `count by(name) ((time() - container_last_seen{name=~"^otel-collector.*"} Refer to the [alerts reference](./alerts.md#otel-collector-container-cpu-usage) for 1 alert related to this panel. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100301` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100501` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* @@ -30291,7 +30379,7 @@ Query: `cadvisor_container_cpu_usage_percentage_total{name=~"^otel-collector.*"} Refer to the [alerts reference](./alerts.md#otel-collector-container-memory-usage) for 1 alert related to this panel. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100302` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100502` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* @@ -30313,7 +30401,7 @@ When extremely high, this can indicate a resource usage problem, or can cause pr This panel has no related alerts. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100303` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100503` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* @@ -30334,7 +30422,7 @@ Query: `sum by(name) (rate(container_fs_reads_total{name=~"^otel-collector.*"}[1 Refer to the [alerts reference](./alerts.md#otel-collector-pods-available-percentage) for 1 alert related to this panel. -To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance. +To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100600` on your Sourcegraph instance. *Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).* diff --git a/monitoring/definitions/otel_collector.go b/monitoring/definitions/otel_collector.go index 511c733560a..5054078b964 100644 --- a/monitoring/definitions/otel_collector.go +++ b/monitoring/definitions/otel_collector.go @@ -98,6 +98,61 @@ func OtelCollector() *monitoring.Dashboard { }, }, }, + { + Title: "Queue Length", + Hidden: false, + Rows: []monitoring.Row{ + { + { + Name: "otelcol_exporter_queue_capacity", + Description: "exporter queue capacity", + Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"), + Owner: monitoring.ObservableOwnerDevOps, + Query: "sum by (exporter) (rate(otelcol_exporter_queue_capacity{job=~\"^.*\"}[1m]))", + NoAlert: true, + Interpretation: `Shows the the capacity of the retry queue (in batches).`, + }, + { + Name: "otelcol_exporter_queue_size", + Description: "exporter queue size", + Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"), + Owner: monitoring.ObservableOwnerDevOps, + Query: "sum by (exporter) (rate(otelcol_exporter_queue_size{job=~\"^.*\"}[1m]))", + NoAlert: true, + Interpretation: `Shows the current size of retry queue`, + }, + { + Name: "otelcol_exporter_enqueue_failed_spans", + Description: "exporter enqueue failed spans", + Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"), + Owner: monitoring.ObservableOwnerDevOps, + Query: "sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~\"^.*\"}[1m]))", + Warning: monitoring.Alert().Greater(0).For(5 * time.Minute), + NextSteps: "Check the configuration of the exporter and if the service being exported is up. This may be caused by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.", + Interpretation: `Shows the rate of spans failed to be enqueued by the configured exporter. A number higher than 0 for a long period can indicate a problem with the exporter configuration`, + }, + }, + }, + }, + { + Title: "Processors", + Hidden: false, + Rows: []monitoring.Row{ + { + { + Name: "otelcol_processor_dropped_spans", + Description: "spans dropped per processor per minute", + Panel: monitoring.Panel().Unit(monitoring.Number).LegendFormat("processor: {{processor}}"), + Owner: monitoring.ObservableOwnerDevOps, + Query: "sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))", + Warning: monitoring.Alert().Greater(0).For(5 * time.Minute), + NextSteps: "Check the configuration of the processor", + Interpretation: `Shows the rate of spans dropped by the configured processor`, + }, + }, + }, + }, + { Title: "Collector resource usage", Hidden: false,