mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 17:11:49 +00:00
monitoring: update otel dashboard metrics (#55899)
* add metrics to OTEL * monitoring: update otel dashboard metrics * create a seperate group for Queue Length * Update monitoring/definitions/otel_collector.go Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com> * update auto gen doc * update auto gen alerts doc --------- Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
This commit is contained in:
parent
edfe22dad6
commit
2892eba932
@ -7792,6 +7792,68 @@ Generated query for warning alert: `max((sum by (exporter) (rate(otelcol_exporte
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: otelcol_exporter_enqueue_failed_spans
|
||||
|
||||
<p class="subtitle">exporter enqueue failed spans</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> otel-collector: 0+ exporter enqueue failed spans for 5m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Check the configuration of the exporter and if the service being exported is up. This may be caused by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.
|
||||
- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otelcol-exporter-enqueue-failed-spans).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_otel-collector_otelcol_exporter_enqueue_failed_spans"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~"^.*"}[1m]))) > 0)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: otelcol_processor_dropped_spans
|
||||
|
||||
<p class="subtitle">spans dropped per processor per minute</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> otel-collector: 0+ spans dropped per processor per minute for 5m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Check the configuration of the processor
|
||||
- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otelcol-processor-dropped-spans).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_otel-collector_otelcol_processor_dropped_spans"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))) > 0)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: container_cpu_usage
|
||||
|
||||
<p class="subtitle">container cpu usage total (1m average) across all cores by instance</p>
|
||||
|
||||
104
doc/admin/observability/dashboards.md
generated
104
doc/admin/observability/dashboards.md
generated
@ -30164,6 +30164,94 @@ Query: `sum by (exporter) (rate(otelcol_exporter_send_failed_spans[1m]))`
|
||||
|
||||
<br />
|
||||
|
||||
### OpenTelemetry Collector: Queue Length
|
||||
|
||||
#### otel-collector: otelcol_exporter_queue_capacity
|
||||
|
||||
<p class="subtitle">Exporter queue capacity</p>
|
||||
|
||||
Shows the the capacity of the retry queue (in batches).
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (exporter) (rate(otelcol_exporter_queue_capacity{job=~"^.*"}[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: otelcol_exporter_queue_size
|
||||
|
||||
<p class="subtitle">Exporter queue size</p>
|
||||
|
||||
Shows the current size of retry queue
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (exporter) (rate(otelcol_exporter_queue_size{job=~"^.*"}[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: otelcol_exporter_enqueue_failed_spans
|
||||
|
||||
<p class="subtitle">Exporter enqueue failed spans</p>
|
||||
|
||||
Shows the rate of spans failed to be enqueued by the configured exporter. A number higher than 0 for a long period can indicate a problem with the exporter configuration
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-otelcol-exporter-enqueue-failed-spans) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~"^.*"}[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### OpenTelemetry Collector: Processors
|
||||
|
||||
#### otel-collector: otelcol_processor_dropped_spans
|
||||
|
||||
<p class="subtitle">Spans dropped per processor per minute</p>
|
||||
|
||||
Shows the rate of spans dropped by the configured processor
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-otelcol-processor-dropped-spans) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### OpenTelemetry Collector: Collector resource usage
|
||||
|
||||
#### otel-collector: otel_cpu_usage
|
||||
@ -30174,7 +30262,7 @@ Shows CPU usage as reported by the OpenTelemetry collector.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
@ -30195,7 +30283,7 @@ Shows the allocated memory Resident Set Size (RSS) as reported by the OpenTeleme
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100401` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
@ -30222,7 +30310,7 @@ For more information on configuring processors for the OpenTelemetry collector s
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100402` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
@ -30253,7 +30341,7 @@ value change independent of deployment events (such as an upgrade), it could ind
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100500` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
@ -30272,7 +30360,7 @@ Query: `count by(name) ((time() - container_last_seen{name=~"^otel-collector.*"}
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-container-cpu-usage) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100301` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100501` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
@ -30291,7 +30379,7 @@ Query: `cadvisor_container_cpu_usage_percentage_total{name=~"^otel-collector.*"}
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-container-memory-usage) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100302` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100502` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
@ -30313,7 +30401,7 @@ When extremely high, this can indicate a resource usage problem, or can cause pr
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100303` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100503` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
@ -30334,7 +30422,7 @@ Query: `sum by(name) (rate(container_fs_reads_total{name=~"^otel-collector.*"}[1
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-pods-available-percentage) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance.
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100600` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
|
||||
@ -98,6 +98,61 @@ func OtelCollector() *monitoring.Dashboard {
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Title: "Queue Length",
|
||||
Hidden: false,
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "otelcol_exporter_queue_capacity",
|
||||
Description: "exporter queue capacity",
|
||||
Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (exporter) (rate(otelcol_exporter_queue_capacity{job=~\"^.*\"}[1m]))",
|
||||
NoAlert: true,
|
||||
Interpretation: `Shows the the capacity of the retry queue (in batches).`,
|
||||
},
|
||||
{
|
||||
Name: "otelcol_exporter_queue_size",
|
||||
Description: "exporter queue size",
|
||||
Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (exporter) (rate(otelcol_exporter_queue_size{job=~\"^.*\"}[1m]))",
|
||||
NoAlert: true,
|
||||
Interpretation: `Shows the current size of retry queue`,
|
||||
},
|
||||
{
|
||||
Name: "otelcol_exporter_enqueue_failed_spans",
|
||||
Description: "exporter enqueue failed spans",
|
||||
Panel: monitoring.Panel().LegendFormat("exporter: {{exporter}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (exporter) (rate(otelcol_exporter_enqueue_failed_spans{job=~\"^.*\"}[1m]))",
|
||||
Warning: monitoring.Alert().Greater(0).For(5 * time.Minute),
|
||||
NextSteps: "Check the configuration of the exporter and if the service being exported is up. This may be caused by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors.",
|
||||
Interpretation: `Shows the rate of spans failed to be enqueued by the configured exporter. A number higher than 0 for a long period can indicate a problem with the exporter configuration`,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Title: "Processors",
|
||||
Hidden: false,
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "otelcol_processor_dropped_spans",
|
||||
Description: "spans dropped per processor per minute",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Number).LegendFormat("processor: {{processor}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (processor) (rate(otelcol_processor_dropped_spans[1m]))",
|
||||
Warning: monitoring.Alert().Greater(0).For(5 * time.Minute),
|
||||
NextSteps: "Check the configuration of the processor",
|
||||
Interpretation: `Shows the rate of spans dropped by the configured processor`,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
|
||||
{
|
||||
Title: "Collector resource usage",
|
||||
Hidden: false,
|
||||
|
||||
Loading…
Reference in New Issue
Block a user