mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 15:51:43 +00:00
otel: add collector dashboard (#45009)
* add initial dashboard for otel * add failed sent dashboard * extra panels * use sum and rate for resource queries * review comments * add warning alerts * Update monitoring/definitions/otel_collector.go * review comments * run go generate * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * review comments * review feedback also drop two panels * remove brackets in metrics * update docs * fix goimport * gogenerate Co-authored-by: Robert Lin <robert@bobheadxi.dev> Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
This commit is contained in:
parent
61251ab989
commit
6c7389f37c
@ -63,3 +63,8 @@
|
||||
targets:
|
||||
# github proxy
|
||||
- host.docker.internal:6090
|
||||
- labels:
|
||||
job: otel-collector
|
||||
targets:
|
||||
# opentelemetry collector
|
||||
- host.docker.internal:8888
|
||||
|
||||
@ -63,3 +63,8 @@
|
||||
targets:
|
||||
# github proxy
|
||||
- 127.0.0.1:6090
|
||||
- labels:
|
||||
job: otel-collector
|
||||
targets:
|
||||
# opentelemetry collector
|
||||
- host.docker.internal:8888
|
||||
|
||||
@ -7851,3 +7851,161 @@ Generated query for warning alert: `max((rate(src_telemetry_job_total{op="SendEv
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: otel_span_refused
|
||||
|
||||
<p class="subtitle">spans refused per receiver</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> otel-collector: 1+ spans refused per receiver for 5m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Check logs of the collector and configuration of the receiver
|
||||
- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otel-span-refused).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_otel-collector_otel_span_refused"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((sum by(receiver) (rate(otelcol_receiver_refused_spans[1m]))) > 1)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: otel_span_export_failures
|
||||
|
||||
<p class="subtitle">span export failures by exporter</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> otel-collector: 1+ span export failures by exporter for 5m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Check the configuration of the exporter and if the service being exported is up
|
||||
- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#otel-collector-otel-span-export-failures).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_otel-collector_otel_span_export_failures"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((sum by(exporter) (rate(otelcol_exporter_send_failed_spans[1m]))) > 1)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: container_cpu_usage
|
||||
|
||||
<p class="subtitle">container cpu usage total (1m average) across all cores by instance</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> otel-collector: 99%+ container cpu usage total (1m average) across all cores by instance
|
||||
|
||||
**Next steps**
|
||||
|
||||
- **Kubernetes:** Consider increasing CPU limits in the the relevant `Deployment.yaml`.
|
||||
- **Docker Compose:** Consider increasing `cpus:` of the otel-collector container in `docker-compose.yml`.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#otel-collector-container-cpu-usage).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_otel-collector_container_cpu_usage"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((cadvisor_container_cpu_usage_percentage_total{name=~"^otel-collector.*"}) >= 99)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: container_memory_usage
|
||||
|
||||
<p class="subtitle">container memory usage by instance</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> otel-collector: 99%+ container memory usage by instance
|
||||
|
||||
**Next steps**
|
||||
|
||||
- **Kubernetes:** Consider increasing memory limit in relevant `Deployment.yaml`.
|
||||
- **Docker Compose:** Consider increasing `memory:` of otel-collector container in `docker-compose.yml`.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#otel-collector-container-memory-usage).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_otel-collector_container_memory_usage"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((cadvisor_container_memory_usage_percentage_total{name=~"^otel-collector.*"}) >= 99)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: pods_available_percentage
|
||||
|
||||
<p class="subtitle">percentage pods available</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-critical">critical</span> otel-collector: less than 90% percentage pods available for 10m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Determine if the pod was OOM killed using `kubectl describe pod otel-collector` (look for `OOMKilled: true`) and, if so, consider increasing the memory limit in the relevant `Deployment.yaml`.
|
||||
- Check the logs before the container restarted to see if there are `panic:` messages or similar using `kubectl logs -p otel-collector`.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#otel-collector-pods-available-percentage).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"critical_otel-collector_pods_available_percentage"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for critical alert: `min((sum by(app) (up{app=~".*otel-collector"}) / count by(app) (up{app=~".*otel-collector"}) * 100) <= 90)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
|
||||
298
doc/admin/observability/dashboards.md
generated
298
doc/admin/observability/dashboards.md
generated
@ -24485,3 +24485,301 @@ Query: `rate(src_telemetry_job_total{op="SendEvents"}[1h]) / on() group_right()
|
||||
|
||||
<br />
|
||||
|
||||
## OpenTelemetry Collector
|
||||
|
||||
<p class="subtitle">The OpenTelemetry collector ingests OpenTelemetry data from Sourcegraph and exports it to the configured backends.</p>
|
||||
|
||||
To see this dashboard, visit `/-/debug/grafana/d/otel-collector/otel-collector` on your Sourcegraph instance.
|
||||
|
||||
### OpenTelemetry Collector: Receivers
|
||||
|
||||
#### otel-collector: otel_span_receive_rate
|
||||
|
||||
<p class="subtitle">Spans received per receiver per minute</p>
|
||||
|
||||
Shows the rate of spans accepted by the configured reveiver
|
||||
|
||||
A Trace is a collection of spans and a span represents a unit of work or operation. Spans are the building blocks of Traces.
|
||||
The spans have only been accepted by the receiver, which means they still have to move through the configured pipeline to be exported.
|
||||
For more information on tracing and configuration of a OpenTelemetry receiver see https://opentelemetry.io/docs/collector/configuration/#receivers.
|
||||
|
||||
See the Exporters section see spans that have made it through the pipeline and are exported.
|
||||
|
||||
Depending the configured processors, received spans might be dropped and not exported. For more information on configuring processors see
|
||||
https://opentelemetry.io/docs/collector/configuration/#processors.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100000` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (receiver) (rate(otelcol_receiver_accepted_spans[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: otel_span_refused
|
||||
|
||||
<p class="subtitle">Spans refused per receiver</p>
|
||||
|
||||
Shows the amount of spans that have been refused by a receiver.
|
||||
|
||||
A Trace is a collection of spans. A Span represents a unit of work or operation. Spans are the building blocks of Traces.
|
||||
|
||||
Spans can be rejected either due to a misconfigured receiver or receiving spans in the wrong format. The log of the collector will have more information on why a span was rejected.
|
||||
For more information on tracing and configuration of a OpenTelemetry receiver see https://opentelemetry.io/docs/collector/configuration/#receivers.
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-otel-span-refused) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100001` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (receiver) (rate(otelcol_receiver_refused_spans[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### OpenTelemetry Collector: Exporters
|
||||
|
||||
#### otel-collector: otel_span_export_rate
|
||||
|
||||
<p class="subtitle">Spans exported per exporter per minute</p>
|
||||
|
||||
Shows the rate of spans being sent by the exporter
|
||||
|
||||
A Trace is a collection of spans. A Span represents a unit of work or operation. Spans are the building blocks of Traces.
|
||||
The rate of spans here indicates spans that have made it through the configured pipeline and have been sent to the configured export destination.
|
||||
|
||||
For more information on configuring a exporter for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#exporters.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100100` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (exporter) (rate(otelcol_exporter_sent_spans[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: otel_span_export_failures
|
||||
|
||||
<p class="subtitle">Span export failures by exporter</p>
|
||||
|
||||
Shows the rate of spans failed to be sent by the configured reveiver. A number higher than 0 for a long period can indicate a problem with the exporter configuration or with the service that is being exported too
|
||||
|
||||
For more information on configuring a exporter for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#exporters.
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-otel-span-export-failures) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100101` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (exporter) (rate(otelcol_exporter_send_failed_spans[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### OpenTelemetry Collector: Collector resource usage
|
||||
|
||||
#### otel-collector: otel_cpu_usage
|
||||
|
||||
<p class="subtitle">Cpu usage of the collector</p>
|
||||
|
||||
Shows the cpu usage of the OpenTelemetry collector
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100200` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (job) (rate(otelcol_process_cpu_seconds{job=~"^.*"}[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: otel_memory_resident_set_size
|
||||
|
||||
<p class="subtitle">Memory allocated to the otel collector</p>
|
||||
|
||||
Shows the memory Resident Set Size (RSS) allocated to the OpenTelemetry collector
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100201` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (job) (rate(otelcol_process_memory_rss{job=~"^.*"}[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: otel_memory_usage
|
||||
|
||||
<p class="subtitle">Memory used by the collector</p>
|
||||
|
||||
Shows how much memory is being used by the otel collector.
|
||||
|
||||
* High memory usage might indicate thad the configured pipeline is keeping a lot of spans in memory for processing
|
||||
* Spans failing to be sent and the exporter is configured to retry
|
||||
* A high batch count by using a batch processor
|
||||
|
||||
For more information on configuring processors for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#processors.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100202` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (job) (rate(otelcol_process_runtime_total_alloc_bytes{job=~"^.*"}[1m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### OpenTelemetry Collector: Container monitoring (not available on server)
|
||||
|
||||
#### otel-collector: container_missing
|
||||
|
||||
<p class="subtitle">Container missing</p>
|
||||
|
||||
This value is the number of times a container has not been seen for more than one minute. If you observe this
|
||||
value change independent of deployment events (such as an upgrade), it could indicate pods are being OOM killed or terminated for some other reasons.
|
||||
|
||||
- **Kubernetes:**
|
||||
- Determine if the pod was OOM killed using `kubectl describe pod otel-collector` (look for `OOMKilled: true`) and, if so, consider increasing the memory limit in the relevant `Deployment.yaml`.
|
||||
- Check the logs before the container restarted to see if there are `panic:` messages or similar using `kubectl logs -p otel-collector`.
|
||||
- **Docker Compose:**
|
||||
- Determine if the pod was OOM killed using `docker inspect -f '{{json .State}}' otel-collector` (look for `"OOMKilled":true`) and, if so, consider increasing the memory limit of the otel-collector container in `docker-compose.yml`.
|
||||
- Check the logs before the container restarted to see if there are `panic:` messages or similar using `docker logs otel-collector` (note this will include logs from the previous and currently running container).
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100300` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `count by(name) ((time() - container_last_seen{name=~"^otel-collector.*"}) > 60)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: container_cpu_usage
|
||||
|
||||
<p class="subtitle">Container cpu usage total (1m average) across all cores by instance</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-container-cpu-usage) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100301` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `cadvisor_container_cpu_usage_percentage_total{name=~"^otel-collector.*"}`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: container_memory_usage
|
||||
|
||||
<p class="subtitle">Container memory usage by instance</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-container-memory-usage) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100302` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `cadvisor_container_memory_usage_percentage_total{name=~"^otel-collector.*"}`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### otel-collector: fs_io_operations
|
||||
|
||||
<p class="subtitle">Filesystem reads and writes rate by instance over 1h</p>
|
||||
|
||||
This value indicates the number of filesystem read and write operations by containers of this service.
|
||||
When extremely high, this can indicate a resource usage problem, or can cause problems with the service itself, especially if high values or spikes correlate with {{CONTAINER_NAME}} issues.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100303` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by(name) (rate(container_fs_reads_total{name=~"^otel-collector.*"}[1h]) + rate(container_fs_writes_total{name=~"^otel-collector.*"}[1h]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### OpenTelemetry Collector: Kubernetes monitoring (only available on Kubernetes)
|
||||
|
||||
#### otel-collector: pods_available_percentage
|
||||
|
||||
<p class="subtitle">Percentage pods available</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#otel-collector-pods-available-percentage) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/otel-collector/otel-collector?viewPanel=100400` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Cloud DevOps team](https://handbook.sourcegraph.com/departments/engineering/teams/devops).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by(app) (up{app=~".*otel-collector"}) / count by (app) (up{app=~".*otel-collector"}) * 100`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
|
||||
@ -24,6 +24,9 @@ extensions:
|
||||
endpoint: ":55679"
|
||||
|
||||
service:
|
||||
telemetry:
|
||||
metrics:
|
||||
address: ":8888"
|
||||
extensions: [health_check,zpages]
|
||||
pipelines:
|
||||
traces:
|
||||
|
||||
@ -31,6 +31,7 @@ func Default() Dashboards {
|
||||
CodeIntelRanking(),
|
||||
CodeIntelUploads(),
|
||||
Telemetry(),
|
||||
OtelCollector(),
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
145
monitoring/definitions/otel_collector.go
Normal file
145
monitoring/definitions/otel_collector.go
Normal file
@ -0,0 +1,145 @@
|
||||
package definitions
|
||||
|
||||
import (
|
||||
"time"
|
||||
|
||||
"github.com/sourcegraph/sourcegraph/monitoring/definitions/shared"
|
||||
"github.com/sourcegraph/sourcegraph/monitoring/monitoring"
|
||||
)
|
||||
|
||||
func OtelCollector() *monitoring.Dashboard {
|
||||
containerName := "otel-collector"
|
||||
|
||||
return &monitoring.Dashboard{
|
||||
Name: containerName,
|
||||
Title: "OpenTelemetry Collector",
|
||||
Description: "The OpenTelemetry collector ingests OpenTelemetry data from Sourcegraph and exports it to the configured backends.",
|
||||
Groups: []monitoring.Group{
|
||||
{
|
||||
Title: "Receivers",
|
||||
Hidden: false,
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "otel_span_receive_rate",
|
||||
Description: "spans received per receiver per minute",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Number).LegendFormat("receiver: {{receiver}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (receiver) (rate(otelcol_receiver_accepted_spans[1m]))",
|
||||
NoAlert: true,
|
||||
Interpretation: `
|
||||
Shows the rate of spans accepted by the configured reveiver
|
||||
|
||||
A Trace is a collection of spans and a span represents a unit of work or operation. Spans are the building blocks of Traces.
|
||||
The spans have only been accepted by the receiver, which means they still have to move through the configured pipeline to be exported.
|
||||
For more information on tracing and configuration of a OpenTelemetry receiver see https://opentelemetry.io/docs/collector/configuration/#receivers.
|
||||
|
||||
See the Exporters section see spans that have made it through the pipeline and are exported.
|
||||
|
||||
Depending the configured processors, received spans might be dropped and not exported. For more information on configuring processors see
|
||||
https://opentelemetry.io/docs/collector/configuration/#processors.`,
|
||||
},
|
||||
{
|
||||
Name: "otel_span_refused",
|
||||
Description: "spans refused per receiver",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Number).LegendFormat("receiver: {{receiver}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (receiver) (rate(otelcol_receiver_refused_spans[1m]))",
|
||||
Warning: monitoring.Alert().Greater(1).For(5 * time.Minute),
|
||||
NextSteps: "Check logs of the collector and configuration of the receiver",
|
||||
Interpretation: `
|
||||
Shows the amount of spans that have been refused by a receiver.
|
||||
|
||||
A Trace is a collection of spans. A Span represents a unit of work or operation. Spans are the building blocks of Traces.
|
||||
|
||||
Spans can be rejected either due to a misconfigured receiver or receiving spans in the wrong format. The log of the collector will have more information on why a span was rejected.
|
||||
For more information on tracing and configuration of a OpenTelemetry receiver see https://opentelemetry.io/docs/collector/configuration/#receivers.`,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Title: "Exporters",
|
||||
Hidden: false,
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "otel_span_export_rate",
|
||||
Description: "spans exported per exporter per minute",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Number).LegendFormat("exporter: {{exporter}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (exporter) (rate(otelcol_exporter_sent_spans[1m]))",
|
||||
NoAlert: true,
|
||||
Interpretation: `
|
||||
Shows the rate of spans being sent by the exporter
|
||||
|
||||
A Trace is a collection of spans. A Span represents a unit of work or operation. Spans are the building blocks of Traces.
|
||||
The rate of spans here indicates spans that have made it through the configured pipeline and have been sent to the configured export destination.
|
||||
|
||||
For more information on configuring a exporter for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#exporters.`,
|
||||
},
|
||||
{
|
||||
Name: "otel_span_export_failures",
|
||||
Description: "span export failures by exporter",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Number).LegendFormat("exporter: {{exporter}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (exporter) (rate(otelcol_exporter_send_failed_spans[1m]))",
|
||||
Warning: monitoring.Alert().Greater(1).For(5 * time.Minute),
|
||||
NextSteps: "Check the configuration of the exporter and if the service being exported is up",
|
||||
Interpretation: `
|
||||
Shows the rate of spans failed to be sent by the configured reveiver. A number higher than 0 for a long period can indicate a problem with the exporter configuration or with the service that is being exported too
|
||||
|
||||
For more information on configuring a exporter for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#exporters.`,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Title: "Collector resource usage",
|
||||
Hidden: false,
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "otel_cpu_usage",
|
||||
Description: "cpu usage of the collector",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Seconds).LegendFormat("{{job}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (job) (rate(otelcol_process_cpu_seconds{job=~\"^.*\"}[1m]))",
|
||||
NoAlert: true,
|
||||
Interpretation: `
|
||||
Shows the cpu usage of the OpenTelemetry collector`,
|
||||
},
|
||||
{
|
||||
Name: "otel_memory_resident_set_size",
|
||||
Description: "memory allocated to the otel collector",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Bytes).LegendFormat("{{job}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (job) (rate(otelcol_process_memory_rss{job=~\"^.*\"}[1m]))",
|
||||
NoAlert: true,
|
||||
Interpretation: `
|
||||
Shows the memory Resident Set Size (RSS) allocated to the OpenTelemetry collector`,
|
||||
},
|
||||
{
|
||||
Name: "otel_memory_usage",
|
||||
Description: "memory used by the collector",
|
||||
Panel: monitoring.Panel().Unit(monitoring.Bytes).LegendFormat("{{job}}"),
|
||||
Owner: monitoring.ObservableOwnerDevOps,
|
||||
Query: "sum by (job) (rate(otelcol_process_runtime_total_alloc_bytes{job=~\"^.*\"}[1m]))",
|
||||
NoAlert: true,
|
||||
Interpretation: `
|
||||
Shows how much memory is being used by the otel collector.
|
||||
|
||||
* High memory usage might indicate thad the configured pipeline is keeping a lot of spans in memory for processing
|
||||
* Spans failing to be sent and the exporter is configured to retry
|
||||
* A high batch count by using a batch processor
|
||||
|
||||
For more information on configuring processors for the OpenTelemetry collector see https://opentelemetry.io/docs/collector/configuration/#processors.`,
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
shared.NewContainerMonitoringGroup("otel-collector", monitoring.ObservableOwnerDevOps, nil),
|
||||
shared.NewKubernetesMonitoringGroup("otel-collector", monitoring.ObservableOwnerDevOps, nil),
|
||||
},
|
||||
}
|
||||
}
|
||||
@ -752,6 +752,7 @@ commands:
|
||||
docker container rm otel-collector
|
||||
docker run --rm --name=otel-collector $DOCKER_NET $DOCKER_ARGS \
|
||||
-p 4317:4317 -p 4318:4318 -p 55679:55679 -p 55670:55670 \
|
||||
-p 8888:8888 \
|
||||
-e JAEGER_HOST=$JAEGER_HOST \
|
||||
-e HONEYCOMB_API_KEY=$HONEYCOMB_API_KEY \
|
||||
-e HONEYCOMB_DATASET=$HONEYCOMB_DATASET \
|
||||
|
||||
Loading…
Reference in New Issue
Block a user