mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 17:51:57 +00:00
monitoring: add telemetrygatewayexporter panels, improve metrics (#57171)
Part of https://github.com/sourcegraph/sourcegraph/issues/56970 - this adds some dashboards for the export side of things, as well as improves the existing metrics. Only includes warnings. ## Test plan Had to test locally only because I ended up changing the metrics a bit, but validated that the queue size metric works in S2. Testing locally: ```yaml # sg.config.overwrite.yaml env: TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL: "30s" TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION: "5m" TELEMETRY_GATEWAY_EXPORTER_QUEUE_CLEANUP_INTERVAL: "10m" ``` ``` sg start sg start monitoring ``` Do lots of searches to generate events. Note `telemetry-export` feature flag must be enabled Data is not realistic because of the super high interval I configured for testing, but it shows that things work: 
This commit is contained in:
parent
876f67c040
commit
96f2d595e0
@ -7480,6 +7480,100 @@ Generated query for warning alert: `max((rate(src_telemetry_job_total{op="SendEv
|
||||
|
||||
<br />
|
||||
|
||||
## telemetry: telemetrygatewayexporter_exporter_errors_total
|
||||
|
||||
<p class="subtitle">events exporter operation errors every 30m</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> telemetry: 0+ events exporter operation errors every 30m
|
||||
|
||||
**Next steps**
|
||||
|
||||
- See worker logs in the `worker.telemetrygateway-exporter` log scope for more details.
|
||||
If logs only indicate that exports failed, reach out to Sourcegraph with relevant log entries, as this may be an issue in Sourcegraph`s Telemetry Gateway service.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#telemetry-telemetrygatewayexporter-exporter-errors-total).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_telemetry_telemetrygatewayexporter_exporter_errors_total"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((sum(increase(src_telemetrygatewayexporter_exporter_errors_total{job=~"^worker.*"}[30m]))) > 0)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## telemetry: telemetrygatewayexporter_queue_cleanup_errors_total
|
||||
|
||||
<p class="subtitle">export queue cleanup operation errors every 30m</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> telemetry: 0+ export queue cleanup operation errors every 30m
|
||||
|
||||
**Next steps**
|
||||
|
||||
- See worker logs in the `worker.telemetrygateway-exporter` log scope for more details.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#telemetry-telemetrygatewayexporter-queue-cleanup-errors-total).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_telemetry_telemetrygatewayexporter_queue_cleanup_errors_total"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((sum(increase(src_telemetrygatewayexporter_queue_cleanup_errors_total{job=~"^worker.*"}[30m]))) > 0)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## telemetry: telemetrygatewayexporter_queue_metrics_reporter_errors_total
|
||||
|
||||
<p class="subtitle">export backlog metrics reporting operation errors every 30m</p>
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> telemetry: 0+ export backlog metrics reporting operation errors every 30m
|
||||
|
||||
**Next steps**
|
||||
|
||||
- See worker logs in the `worker.telemetrygateway-exporter` log scope for more details.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#telemetry-telemetrygatewayexporter-queue-metrics-reporter-errors-total).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_telemetry_telemetrygatewayexporter_queue_metrics_reporter_errors_total"
|
||||
]
|
||||
```
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `max((sum(increase(src_telemetrygatewayexporter_queue_metrics_reporter_errors_total{job=~"^worker.*"}[30m]))) > 0)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## otel-collector: otel_span_refused
|
||||
|
||||
<p class="subtitle">spans refused per receiver</p>
|
||||
|
||||
299
doc/admin/observability/dashboards.md
generated
299
doc/admin/observability/dashboards.md
generated
@ -30806,6 +30806,305 @@ Query: `rate(src_telemetry_job_total{op="SendEvents"}[1h]) / on() group_right()
|
||||
|
||||
<br />
|
||||
|
||||
### Telemetry: Telemetry Gateway Exporter: Export and queue metrics
|
||||
|
||||
#### telemetry: telemetry_gateway_exporter_queue_size
|
||||
|
||||
<p class="subtitle">Telemetry event payloads pending export</p>
|
||||
|
||||
The number of events queued to be exported.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100300` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(src_telemetrygatewayexporter_queue_size)`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: src_telemetrygatewayexporter_exported_events
|
||||
|
||||
<p class="subtitle">Events exported from queue per hour</p>
|
||||
|
||||
The number of events being exported.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100301` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `max(increase(src_telemetrygatewayexporter_exported_events[1h]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetry_gateway_exporter_batch_size
|
||||
|
||||
<p class="subtitle">Number of events exported per batch over 30m</p>
|
||||
|
||||
The number of events exported in each batch.
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100302` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (le) (rate(src_telemetrygatewayexporter_batch_size_bucket[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### Telemetry: Telemetry Gateway Exporter: Export job operations
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_exporter_total
|
||||
|
||||
<p class="subtitle">Events exporter operations every 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100400` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_exporter_total{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_exporter_99th_percentile_duration
|
||||
|
||||
<p class="subtitle">Aggregate successful events exporter operation duration distribution over 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100401` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (le)(rate(src_telemetrygatewayexporter_exporter_duration_seconds_bucket{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_exporter_errors_total
|
||||
|
||||
<p class="subtitle">Events exporter operation errors every 30m</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#telemetry-telemetrygatewayexporter-exporter-errors-total) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100402` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_exporter_errors_total{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_exporter_error_rate
|
||||
|
||||
<p class="subtitle">Events exporter operation error rate over 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100403` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_exporter_errors_total{job=~"^worker.*"}[30m])) / (sum(increase(src_telemetrygatewayexporter_exporter_total{job=~"^worker.*"}[30m])) + sum(increase(src_telemetrygatewayexporter_exporter_errors_total{job=~"^worker.*"}[30m]))) * 100`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### Telemetry: Telemetry Gateway Exporter: Export queue cleanup job operations
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_cleanup_total
|
||||
|
||||
<p class="subtitle">Export queue cleanup operations every 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100500` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_queue_cleanup_total{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_cleanup_99th_percentile_duration
|
||||
|
||||
<p class="subtitle">Aggregate successful export queue cleanup operation duration distribution over 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100501` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (le)(rate(src_telemetrygatewayexporter_queue_cleanup_duration_seconds_bucket{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_cleanup_errors_total
|
||||
|
||||
<p class="subtitle">Export queue cleanup operation errors every 30m</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#telemetry-telemetrygatewayexporter-queue-cleanup-errors-total) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100502` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_queue_cleanup_errors_total{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_cleanup_error_rate
|
||||
|
||||
<p class="subtitle">Export queue cleanup operation error rate over 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100503` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_queue_cleanup_errors_total{job=~"^worker.*"}[30m])) / (sum(increase(src_telemetrygatewayexporter_queue_cleanup_total{job=~"^worker.*"}[30m])) + sum(increase(src_telemetrygatewayexporter_queue_cleanup_errors_total{job=~"^worker.*"}[30m]))) * 100`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
### Telemetry: Telemetry Gateway Exporter: Export queue metrics reporting job operations
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_metrics_reporter_total
|
||||
|
||||
<p class="subtitle">Export backlog metrics reporting operations every 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100600` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_queue_metrics_reporter_total{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_metrics_reporter_99th_percentile_duration
|
||||
|
||||
<p class="subtitle">Aggregate successful export backlog metrics reporting operation duration distribution over 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100601` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum by (le)(rate(src_telemetrygatewayexporter_queue_metrics_reporter_duration_seconds_bucket{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_metrics_reporter_errors_total
|
||||
|
||||
<p class="subtitle">Export backlog metrics reporting operation errors every 30m</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#telemetry-telemetrygatewayexporter-queue-metrics-reporter-errors-total) for 1 alert related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100602` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_queue_metrics_reporter_errors_total{job=~"^worker.*"}[30m]))`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
#### telemetry: telemetrygatewayexporter_queue_metrics_reporter_error_rate
|
||||
|
||||
<p class="subtitle">Export backlog metrics reporting operation error rate over 30m</p>
|
||||
|
||||
This panel has no related alerts.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/telemetry/telemetry?viewPanel=100603` on your Sourcegraph instance.
|
||||
|
||||
<sub>*Managed by the [Sourcegraph Data & Analytics team](https://handbook.sourcegraph.com/departments/engineering/teams/data-analytics).*</sub>
|
||||
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Query: `sum(increase(src_telemetrygatewayexporter_queue_metrics_reporter_errors_total{job=~"^worker.*"}[30m])) / (sum(increase(src_telemetrygatewayexporter_queue_metrics_reporter_total{job=~"^worker.*"}[30m])) + sum(increase(src_telemetrygatewayexporter_queue_metrics_reporter_errors_total{job=~"^worker.*"}[30m]))) * 100`
|
||||
|
||||
</details>
|
||||
|
||||
<br />
|
||||
|
||||
## OpenTelemetry Collector
|
||||
|
||||
<p class="subtitle">The OpenTelemetry collector ingests OpenTelemetry data from Sourcegraph and exports it to the configured backends.</p>
|
||||
|
||||
@ -3,9 +3,9 @@ load("@io_bazel_rules_go//go:def.bzl", "go_library")
|
||||
go_library(
|
||||
name = "telemetrygatewayexporter",
|
||||
srcs = [
|
||||
"backlog_metrics.go",
|
||||
"exporter.go",
|
||||
"queue_cleanup.go",
|
||||
"queue_metrics.go",
|
||||
"telemetrygatewayexporter.go",
|
||||
],
|
||||
importpath = "github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/telemetrygatewayexporter",
|
||||
|
||||
@ -43,13 +43,14 @@ func newExporterJob(
|
||||
|
||||
batchSizeHistogram: promauto.NewHistogram(prometheus.HistogramOpts{
|
||||
Namespace: "src",
|
||||
Subsystem: "telemetrygatewayexport",
|
||||
Subsystem: "telemetrygatewayexporter",
|
||||
Name: "batch_size",
|
||||
Help: "Size of event batches exported from the queue.",
|
||||
Buckets: prometheus.ExponentialBucketsRange(1, float64(cfg.MaxExportBatchSize), 10),
|
||||
}),
|
||||
exportedEventsCounter: promauto.NewCounter(prometheus.CounterOpts{
|
||||
Namespace: "src",
|
||||
Subsystem: "telemetrygatewayexport",
|
||||
Subsystem: "telemetrygatewayexporter",
|
||||
Name: "exported_events",
|
||||
Help: "Number of events exported from the queue.",
|
||||
}),
|
||||
@ -61,7 +62,7 @@ func newExporterJob(
|
||||
goroutine.WithDescription("telemetrygatewayexporter events export job"),
|
||||
goroutine.WithInterval(cfg.ExportInterval),
|
||||
goroutine.WithOperation(obctx.Operation(observation.Op{
|
||||
Name: "TelemetryGateway.Export",
|
||||
Name: "TelemetryGatewayExporter.Export",
|
||||
Metrics: metrics.NewREDMetrics(prometheus.DefaultRegisterer, "telemetrygatewayexporter_exporter"),
|
||||
})),
|
||||
)
|
||||
|
||||
@ -9,6 +9,8 @@ import (
|
||||
|
||||
"github.com/sourcegraph/sourcegraph/internal/database"
|
||||
"github.com/sourcegraph/sourcegraph/internal/goroutine"
|
||||
"github.com/sourcegraph/sourcegraph/internal/metrics"
|
||||
"github.com/sourcegraph/sourcegraph/internal/observation"
|
||||
"github.com/sourcegraph/sourcegraph/lib/errors"
|
||||
)
|
||||
|
||||
@ -17,17 +19,17 @@ type queueCleanupJob struct {
|
||||
|
||||
retentionWindow time.Duration
|
||||
|
||||
prunedHistogram prometheus.Histogram
|
||||
prunedCounter prometheus.Counter
|
||||
}
|
||||
|
||||
func newQueueCleanupJob(store database.TelemetryEventsExportQueueStore, cfg config) goroutine.BackgroundRoutine {
|
||||
func newQueueCleanupJob(obctx *observation.Context, store database.TelemetryEventsExportQueueStore, cfg config) goroutine.BackgroundRoutine {
|
||||
job := &queueCleanupJob{
|
||||
store: store,
|
||||
prunedHistogram: promauto.NewHistogram(prometheus.HistogramOpts{
|
||||
prunedCounter: promauto.NewCounter(prometheus.CounterOpts{
|
||||
Namespace: "src",
|
||||
Subsystem: "telemetrygatewayexport",
|
||||
Name: "pruned",
|
||||
Help: "Size of exported events pruned from the queue table.",
|
||||
Subsystem: "telemetrygatewayexporter",
|
||||
Name: "events_pruned",
|
||||
Help: "Events pruned from the queue table.",
|
||||
}),
|
||||
}
|
||||
return goroutine.NewPeriodicGoroutine(
|
||||
@ -36,6 +38,10 @@ func newQueueCleanupJob(store database.TelemetryEventsExportQueueStore, cfg conf
|
||||
goroutine.WithName("telemetrygatewayexporter.queue_cleanup"),
|
||||
goroutine.WithDescription("telemetrygatewayexporter queue cleanup"),
|
||||
goroutine.WithInterval(cfg.QueueCleanupInterval),
|
||||
goroutine.WithOperation(obctx.Operation(observation.Op{
|
||||
Name: "TelemetryGatewayExporter.QueueCleanup",
|
||||
Metrics: metrics.NewREDMetrics(prometheus.DefaultRegisterer, "telemetrygatewayexporter_queue_cleanup"),
|
||||
})),
|
||||
)
|
||||
}
|
||||
|
||||
@ -44,7 +50,7 @@ func (j *queueCleanupJob) Handle(ctx context.Context) error {
|
||||
if err != nil {
|
||||
return errors.Wrap(err, "store.DeletedExported")
|
||||
}
|
||||
j.prunedHistogram.Observe(float64(count))
|
||||
j.prunedCounter.Add(float64(count))
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
@ -9,35 +9,41 @@ import (
|
||||
|
||||
"github.com/sourcegraph/sourcegraph/internal/database"
|
||||
"github.com/sourcegraph/sourcegraph/internal/goroutine"
|
||||
"github.com/sourcegraph/sourcegraph/internal/metrics"
|
||||
"github.com/sourcegraph/sourcegraph/internal/observation"
|
||||
"github.com/sourcegraph/sourcegraph/lib/errors"
|
||||
)
|
||||
|
||||
type backlogMetricsJob struct {
|
||||
type queueMetricsJob struct {
|
||||
store database.TelemetryEventsExportQueueStore
|
||||
|
||||
sizeGauge prometheus.Gauge
|
||||
}
|
||||
|
||||
func newBacklogMetricsJob(store database.TelemetryEventsExportQueueStore) goroutine.BackgroundRoutine {
|
||||
job := &backlogMetricsJob{
|
||||
func newQueueMetricsJob(obctx *observation.Context, store database.TelemetryEventsExportQueueStore) goroutine.BackgroundRoutine {
|
||||
job := &queueMetricsJob{
|
||||
store: store,
|
||||
sizeGauge: promauto.NewGauge(prometheus.GaugeOpts{
|
||||
Namespace: "src",
|
||||
Subsystem: "telemetrygatewayexport",
|
||||
Name: "backlog_size",
|
||||
Subsystem: "telemetrygatewayexporter",
|
||||
Name: "queue_size",
|
||||
Help: "Current number of events waiting to be exported.",
|
||||
}),
|
||||
}
|
||||
return goroutine.NewPeriodicGoroutine(
|
||||
context.Background(),
|
||||
job,
|
||||
goroutine.WithName("telemetrygatewayexporter.backlog_metrics"),
|
||||
goroutine.WithDescription("telemetrygatewayexporter backlog metrics"),
|
||||
goroutine.WithName("telemetrygatewayexporter.queue_metrics_reporter"),
|
||||
goroutine.WithDescription("telemetrygatewayexporter backlog metrics reporting"),
|
||||
goroutine.WithInterval(time.Minute*5),
|
||||
goroutine.WithOperation(obctx.Operation(observation.Op{
|
||||
Name: "TelemetryGatewayExporter.ReportQueueMetrics",
|
||||
Metrics: metrics.NewREDMetrics(prometheus.DefaultRegisterer, "telemetrygatewayexporter_queue_metrics_reporter"),
|
||||
})),
|
||||
)
|
||||
}
|
||||
|
||||
func (j *backlogMetricsJob) Handle(ctx context.Context) error {
|
||||
func (j *queueMetricsJob) Handle(ctx context.Context) error {
|
||||
count, err := j.store.CountUnexported(ctx)
|
||||
if err != nil {
|
||||
return errors.Wrap(err, "store.CountUnexported")
|
||||
@ -38,14 +38,21 @@ func (c *config) Load() {
|
||||
|
||||
c.ExportInterval = env.MustGetDuration("TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL", 10*time.Minute,
|
||||
"Interval at which to export telemetry")
|
||||
if c.ExportInterval > 1*time.Hour {
|
||||
c.AddError(errors.New("TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL cannot be more than 1 hour"))
|
||||
}
|
||||
|
||||
c.MaxExportBatchSize = env.MustGetInt("TELEMETRY_GATEWAY_EXPORTER_EXPORT_BATCH_SIZE", 5000,
|
||||
"Maximum number of events to export in each batch")
|
||||
if c.MaxExportBatchSize < 100 {
|
||||
c.AddError(errors.New("TELEMETRY_GATEWAY_EXPORTER_EXPORT_BATCH_SIZE must be no less than 100"))
|
||||
}
|
||||
|
||||
c.ExportedEventsRetentionWindow = env.MustGetDuration("TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION",
|
||||
2*24*time.Hour, "Duration to retain already-exported telemetry events before deleting")
|
||||
|
||||
c.QueueCleanupInterval = env.MustGetDuration("TELEMETRY_GATEWAY_EXPORTER_QUEUE_CLEANUP_INTERVAL",
|
||||
1*time.Hour, "Interval at which to clean up telemetry export queue")
|
||||
30*time.Minute, "Interval at which to clean up telemetry export queue")
|
||||
}
|
||||
|
||||
type telemetryGatewayExporter struct{}
|
||||
@ -95,7 +102,7 @@ func (t *telemetryGatewayExporter) Routines(initCtx context.Context, observation
|
||||
exporter,
|
||||
*ConfigInst,
|
||||
),
|
||||
newQueueCleanupJob(db.TelemetryEventsExportQueue(), *ConfigInst),
|
||||
newBacklogMetricsJob(db.TelemetryEventsExportQueue()),
|
||||
newQueueCleanupJob(observationCtx, db.TelemetryEventsExportQueue(), *ConfigInst),
|
||||
newQueueMetricsJob(observationCtx, db.TelemetryEventsExportQueue()),
|
||||
}, nil
|
||||
}
|
||||
|
||||
@ -31,8 +31,10 @@ go_library(
|
||||
importpath = "github.com/sourcegraph/sourcegraph/monitoring/definitions",
|
||||
visibility = ["//visibility:public"],
|
||||
deps = [
|
||||
"//lib/pointers",
|
||||
"//monitoring/definitions/shared",
|
||||
"//monitoring/monitoring",
|
||||
"@com_github_grafana_tools_sdk//:sdk",
|
||||
"@com_github_prometheus_common//model",
|
||||
],
|
||||
)
|
||||
|
||||
@ -48,7 +48,7 @@ func (dataAnalytics) NewTelemetryJobOperationsGroup(containerName string) monito
|
||||
},
|
||||
Namespace: usageDataExporterNamespace,
|
||||
DescriptionRoot: "Job operations",
|
||||
Hidden: false,
|
||||
Hidden: true,
|
||||
},
|
||||
SharedObservationGroupOptions: SharedObservationGroupOptions{
|
||||
Total: NoAlertsOption("none"),
|
||||
@ -68,7 +68,7 @@ func (dataAnalytics) NewTelemetryJobOperationsGroup(containerName string) monito
|
||||
func (dataAnalytics) TelemetryJobThroughputGroup(containerName string) monitoring.Group {
|
||||
return monitoring.Group{
|
||||
Title: "Usage data exporter: Utilization",
|
||||
Hidden: false,
|
||||
Hidden: true,
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
|
||||
@ -1,6 +1,12 @@
|
||||
package definitions
|
||||
|
||||
import (
|
||||
"time"
|
||||
|
||||
"github.com/grafana-tools/sdk"
|
||||
"github.com/prometheus/common/model"
|
||||
|
||||
"github.com/sourcegraph/sourcegraph/lib/pointers"
|
||||
"github.com/sourcegraph/sourcegraph/monitoring/definitions/shared"
|
||||
"github.com/sourcegraph/sourcegraph/monitoring/monitoring"
|
||||
)
|
||||
@ -12,9 +18,113 @@ func Telemetry() *monitoring.Dashboard {
|
||||
Title: "Telemetry",
|
||||
Description: "Monitoring telemetry services in Sourcegraph.",
|
||||
Groups: []monitoring.Group{
|
||||
// Legacy dashboards - TODO(@bobheadxi): remove after 5.2.2
|
||||
shared.DataAnalytics.NewTelemetryJobOperationsGroup(containerName),
|
||||
shared.DataAnalytics.NewTelemetryJobQueueGroup(containerName),
|
||||
shared.DataAnalytics.TelemetryJobThroughputGroup(containerName),
|
||||
|
||||
// The new stuff - https://docs.sourcegraph.com/dev/background-information/telemetry
|
||||
{
|
||||
Title: "Telemetry Gateway Exporter: Export and queue metrics",
|
||||
Hidden: true, // TODO(@bobheadxi): not yet enabled by default, un-hide in 5.2.1
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "telemetry_gateway_exporter_queue_size",
|
||||
Description: "telemetry event payloads pending export",
|
||||
Owner: monitoring.ObservableOwnerDataAnalytics,
|
||||
Query: `sum(src_telemetrygatewayexporter_queue_size)`,
|
||||
Panel: monitoring.Panel().Min(0).LegendFormat("events"),
|
||||
NoAlert: true,
|
||||
Interpretation: "The number of events queued to be exported.",
|
||||
},
|
||||
{
|
||||
Name: "src_telemetrygatewayexporter_exported_events",
|
||||
Description: "events exported from queue per hour",
|
||||
Owner: monitoring.ObservableOwnerDataAnalytics,
|
||||
Query: `max(increase(src_telemetrygatewayexporter_exported_events[1h]))`,
|
||||
Panel: monitoring.Panel().Min(0).LegendFormat("events"),
|
||||
NoAlert: true,
|
||||
Interpretation: "The number of events being exported.",
|
||||
},
|
||||
{
|
||||
Name: "telemetry_gateway_exporter_batch_size",
|
||||
Description: "number of events exported per batch over 30m",
|
||||
Owner: monitoring.ObservableOwnerDataAnalytics,
|
||||
Query: "sum by (le) (rate(src_telemetrygatewayexporter_batch_size_bucket[30m]))",
|
||||
Panel: monitoring.PanelHeatmap().
|
||||
With(func(o monitoring.Observable, p *sdk.Panel) {
|
||||
p.HeatmapPanel.YAxis.Format = "short"
|
||||
p.HeatmapPanel.YAxis.Decimals = pointers.Ptr(0)
|
||||
p.HeatmapPanel.DataFormat = "tsbuckets"
|
||||
p.HeatmapPanel.Targets[0].Format = "heatmap"
|
||||
p.HeatmapPanel.Targets[0].LegendFormat = "{{le}}"
|
||||
}),
|
||||
NoAlert: true,
|
||||
Interpretation: "The number of events exported in each batch.",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
shared.Observation.NewGroup(containerName, monitoring.ObservableOwnerDataAnalytics, shared.ObservationGroupOptions{
|
||||
GroupConstructorOptions: shared.GroupConstructorOptions{
|
||||
ObservableConstructorOptions: shared.ObservableConstructorOptions{
|
||||
MetricNameRoot: "telemetrygatewayexporter_exporter",
|
||||
MetricDescriptionRoot: "events exporter",
|
||||
RangeWindow: model.Duration(30 * time.Minute),
|
||||
},
|
||||
Namespace: "Telemetry Gateway Exporter",
|
||||
DescriptionRoot: "Export job operations",
|
||||
Hidden: true, // TODO(@bobheadxi): not yet enabled by default, un-hide in 5.2.1
|
||||
},
|
||||
SharedObservationGroupOptions: shared.SharedObservationGroupOptions{
|
||||
Total: shared.NoAlertsOption("none"),
|
||||
Duration: shared.NoAlertsOption("none"),
|
||||
ErrorRate: shared.NoAlertsOption("none"),
|
||||
Errors: shared.WarningOption(monitoring.Alert().Greater(0), `
|
||||
See worker logs in the 'worker.telemetrygateway-exporter' log scope for more details.
|
||||
If logs only indicate that exports failed, reach out to Sourcegraph with relevant log entries, as this may be an issue in Sourcegraph's Telemetry Gateway service.
|
||||
`),
|
||||
},
|
||||
}),
|
||||
shared.Observation.NewGroup(containerName, monitoring.ObservableOwnerDataAnalytics, shared.ObservationGroupOptions{
|
||||
GroupConstructorOptions: shared.GroupConstructorOptions{
|
||||
ObservableConstructorOptions: shared.ObservableConstructorOptions{
|
||||
MetricNameRoot: "telemetrygatewayexporter_queue_cleanup",
|
||||
MetricDescriptionRoot: "export queue cleanup",
|
||||
RangeWindow: model.Duration(30 * time.Minute),
|
||||
},
|
||||
Namespace: "Telemetry Gateway Exporter",
|
||||
DescriptionRoot: "Export queue cleanup job operations",
|
||||
Hidden: true, // TODO(@bobheadxi): not yet enabled by default, un-hide in 5.2.1
|
||||
},
|
||||
SharedObservationGroupOptions: shared.SharedObservationGroupOptions{
|
||||
Total: shared.NoAlertsOption("none"),
|
||||
Duration: shared.NoAlertsOption("none"),
|
||||
ErrorRate: shared.NoAlertsOption("none"),
|
||||
Errors: shared.WarningOption(monitoring.Alert().Greater(0),
|
||||
"See worker logs in the `worker.telemetrygateway-exporter` log scope for more details."),
|
||||
},
|
||||
}),
|
||||
shared.Observation.NewGroup(containerName, monitoring.ObservableOwnerDataAnalytics, shared.ObservationGroupOptions{
|
||||
GroupConstructorOptions: shared.GroupConstructorOptions{
|
||||
ObservableConstructorOptions: shared.ObservableConstructorOptions{
|
||||
MetricNameRoot: "telemetrygatewayexporter_queue_metrics_reporter",
|
||||
MetricDescriptionRoot: "export backlog metrics reporting",
|
||||
RangeWindow: model.Duration(30 * time.Minute),
|
||||
},
|
||||
Namespace: "Telemetry Gateway Exporter",
|
||||
DescriptionRoot: "Export queue metrics reporting job operations",
|
||||
Hidden: true,
|
||||
},
|
||||
SharedObservationGroupOptions: shared.SharedObservationGroupOptions{
|
||||
Total: shared.NoAlertsOption("none"),
|
||||
Duration: shared.NoAlertsOption("none"),
|
||||
ErrorRate: shared.NoAlertsOption("none"),
|
||||
Errors: shared.WarningOption(monitoring.Alert().Greater(0),
|
||||
"See worker logs in the `worker.telemetrygateway-exporter` log scope for more details."),
|
||||
},
|
||||
}),
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
@ -720,10 +720,8 @@ commands:
|
||||
-v "$(pwd)"/dev/grafana/all:/sg_config_grafana/provisioning/datasources \
|
||||
grafana:candidate >"${GRAFANA_LOG_FILE}" 2>&1
|
||||
install: |
|
||||
echo foobar
|
||||
mkdir -p "${GRAFANA_DISK}"
|
||||
mkdir -p "$(dirname ${GRAFANA_LOG_FILE})"
|
||||
export CACHE=true
|
||||
docker inspect $CONTAINER >/dev/null 2>&1 && docker rm -f $CONTAINER
|
||||
bazel build //docker-images/grafana:image_tarball
|
||||
docker load --input $(bazel cquery //docker-images/grafana:image_tarball --output=files)
|
||||
|
||||
Loading…
Reference in New Issue
Block a user