diff --git a/doc/admin/observability/alerts.md b/doc/admin/observability/alerts.md
index 25b9c5262ab..2b44cfbcc04 100644
--- a/doc/admin/observability/alerts.md
+++ b/doc/admin/observability/alerts.md
@@ -7792,6 +7792,68 @@ Generated query for warning alert: `max((sum by (exporter) (rate(otelcol_exporte
+## otel-collector: otelcol_exporter_enqueue_failed_spans
+
+
exporter enqueue failed spans
+
+**Descriptions**
+
+- warning otel-collector: 0+ exporter enqueue failed spans for 5m0s
+
+**Next steps**
+
+- Check the configuration of the exporter and if the service being exported is up. This may be caused by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.
+- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otelcol-exporter-enqueue-failed-spans).
+- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
+
+```json
+"observability.silenceAlerts": [
+ "warning_otel-collector_otelcol_exporter_enqueue_failed_spans"
+]
+```
+
+*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
+
+
+Technical details
+
+Generated query for warning alert: `max((sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~"^.*"}[1m]))) > 0)`
+
+
+
+
+
+## otel-collector: otelcol_processor_dropped_spans
+
+spans dropped per processor per minute
+
+**Descriptions**
+
+- warning otel-collector: 0+ spans dropped per processor per minute for 5m0s
+
+**Next steps**
+
+- Check the configuration of the processor
+- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otelcol-processor-dropped-spans).
+- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
+
+```json
+"observability.silenceAlerts": [
+ "warning_otel-collector_otelcol_processor_dropped_spans"
+]
+```
+
+*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
+
+
+Technical details
+
+Generated query for warning alert: `max((sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))) > 0)`
+
+
+
+
+
## otel-collector: container_cpu_usage
container cpu usage total (1m average) across all cores by instance
diff --git a/doc/admin/observability/dashboards.md b/doc/admin/observability/dashboards.md
index 16d8e3f7f16..b93a8443e3e 100644
--- a/doc/admin/observability/dashboards.md
+++ b/doc/admin/observability/dashboards.md
@@ -30164,6 +30164,94 @@ Query: `sum by (exporter) (rate(otelcol_exporter_send_failed_spans[1m]))`
+### OpenTelemetry Collector: Queue Length
+
+#### otel-collector: otelcol_exporter_queue_capacity
+
+Exporter queue capacity
+
+Shows the the capacity of the retry queue (in batches).
+
+This panel has no related alerts.
+
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance.
+
+*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
+
+
+Technical details
+
+Query: `sum by (exporter) (rate(otelcol_exporter_queue_capacity{job=~"^.*"}[1m]))`
+
+
+
+
+
+#### otel-collector: otelcol_exporter_queue_size
+
+Exporter queue size
+
+Shows the current size of retry queue
+
+This panel has no related alerts.
+
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance.
+
+*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
+
+
+Technical details
+
+Query: `sum by (exporter) (rate(otelcol_exporter_queue_size{job=~"^.*"}[1m]))`
+
+
+
+
+
+#### otel-collector: otelcol_exporter_enqueue_failed_spans
+
+Exporter enqueue failed spans
+
+Shows the rate of spans failed to be enqueued by the configured exporter. A number higher than 0 for a long period can indicate a problem with the exporter configuration
+
+Refer to the [alerts reference](./alerts.md#otel-collector-otelcol-exporter-enqueue-failed-spans) for 1 alert related to this panel.
+
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance.
+
+*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
+
+
+Technical details
+
+Query: `sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~"^.*"}[1m]))`
+
+
+
+
+
+### OpenTelemetry Collector: Processors
+
+#### otel-collector: otelcol_processor_dropped_spans
+
+Spans dropped per processor per minute
+
+Shows the rate of spans dropped by the configured processor
+
+Refer to the [alerts reference](./alerts.md#otel-collector-otelcol-processor-dropped-spans) for 1 alert related to this panel.
+
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance.
+
+*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
+
+
+Technical details
+
+Query: `sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))`
+
+
+
+
+
### OpenTelemetry Collector: Collector resource usage
#### otel-collector: otel_cpu_usage
@@ -30174,7 +30262,7 @@ Shows CPU usage as reported by the OpenTelemetry collector.
This panel has no related alerts.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
@@ -30195,7 +30283,7 @@ Shows the allocated memory Resident Set Size (RSS) as reported by the OpenTeleme
This panel has no related alerts.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100401` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
@@ -30222,7 +30310,7 @@ For more information on configuring processors for the OpenTelemetry collector s
This panel has no related alerts.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100402` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
@@ -30253,7 +30341,7 @@ value change independent of deployment events (such as an upgrade), it could ind
This panel has no related alerts.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100500` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
@@ -30272,7 +30360,7 @@ Query: `count by(name) ((time() - container_last_seen{name=~"^otel-collector.*"}
Refer to the [alerts reference](./alerts.md#otel-collector-container-cpu-usage) for 1 alert related to this panel.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100301` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100501` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
@@ -30291,7 +30379,7 @@ Query: `cadvisor_container_cpu_usage_percentage_total{name=~"^otel-collector.*"}
Refer to the [alerts reference](./alerts.md#otel-collector-container-memory-usage) for 1 alert related to this panel.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100302` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100502` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
@@ -30313,7 +30401,7 @@ When extremely high, this can indicate a resource usage problem, or can cause pr
This panel has no related alerts.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100303` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100503` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
@@ -30334,7 +30422,7 @@ Query: `sum by(name) (rate(container_fs_reads_total{name=~"^otel-collector.*"}[1
Refer to the [alerts reference](./alerts.md#otel-collector-pods-available-percentage) for 1 alert related to this panel.
-To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance.
+To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100600` on your Sourcegraph instance.
*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*
diff --git a/monitoring/definitions/otel_collector.go b/monitoring/definitions/otel_collector.go
index 511c733560a..5054078b964 100644
--- a/monitoring/definitions/otel_collector.go
+++ b/monitoring/definitions/otel_collector.go
@@ -98,6 +98,61 @@ func OtelCollector() *monitoring.Dashboard {
},
},
},
+ {
+ Title: "Queue Length",
+ Hidden: false,
+ Rows: []monitoring.Row{
+ {
+ {
+ Name: "otelcol_exporter_queue_capacity",
+ Description: "exporter queue capacity",
+ Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"),
+ Owner: monitoring.ObservableOwnerDevOps,
+ Query: "sum by (exporter) (rate(otelcol_exporter_queue_capacity{job=~\"^.*\"}[1m]))",
+ NoAlert: true,
+ Interpretation: `Shows the the capacity of the retry queue (in batches).`,
+ },
+ {
+ Name: "otelcol_exporter_queue_size",
+ Description: "exporter queue size",
+ Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"),
+ Owner: monitoring.ObservableOwnerDevOps,
+ Query: "sum by (exporter) (rate(otelcol_exporter_queue_size{job=~\"^.*\"}[1m]))",
+ NoAlert: true,
+ Interpretation: `Shows the current size of retry queue`,
+ },
+ {
+ Name: "otelcol_exporter_enqueue_failed_spans",
+ Description: "exporter enqueue failed spans",
+ Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"),
+ Owner: monitoring.ObservableOwnerDevOps,
+ Query: "sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~\"^.*\"}[1m]))",
+ Warning: monitoring.Alert().Greater(0).For(5 * time.Minute),
+ NextSteps: "Check the configuration of the exporter and if the service being exported is up. This may be caused by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.",
+ Interpretation: `Shows the rate of spans failed to be enqueued by the configured exporter. A number higher than 0 for a long period can indicate a problem with the exporter configuration`,
+ },
+ },
+ },
+ },
+ {
+ Title: "Processors",
+ Hidden: false,
+ Rows: []monitoring.Row{
+ {
+ {
+ Name: "otelcol_processor_dropped_spans",
+ Description: "spans dropped per processor per minute",
+ Panel: monitoring.Panel().Unit(monitoring.Number).LegendFormat("processor: {{processor}}"),
+ Owner: monitoring.ObservableOwnerDevOps,
+ Query: "sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))",
+ Warning: monitoring.Alert().Greater(0).For(5 * time.Minute),
+ NextSteps: "Check the configuration of the processor",
+ Interpretation: `Shows the rate of spans dropped by the configured processor`,
+ },
+ },
+ },
+ },
+
{
Title: "Collector resource usage",
Hidden: false,