mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 17:51:57 +00:00
monitoring(prometheus): add new panels (#17246)
This commit is contained in:
parent
57098f2cb3
commit
d0b5dc5ba1
@ -4892,21 +4892,23 @@ with your code hosts connections or networking issues affecting communication wi
|
||||
|
||||
<br />
|
||||
|
||||
## prometheus: prometheus_metrics_bloat
|
||||
## prometheus: prometheus_rule_group_evaluation
|
||||
|
||||
<p class="subtitle">prometheus metrics payload size (distribution)</p>
|
||||
<p class="subtitle">average prometheus rule group evaluation duration over 10m (distribution)</p>
|
||||
|
||||
**Descriptions:**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> prometheus: 20000B+ prometheus metrics payload size
|
||||
- <span class="badge badge-warning">warning</span> prometheus: 30s+ average prometheus rule group evaluation duration over 10m
|
||||
|
||||
**Possible solutions:**
|
||||
|
||||
- Try increasing resources for Prometheus.
|
||||
- **Refer to the [dashboards reference](./dashboards.md#prometheus-prometheus-rule-group-evaluation)** for more help interpreting this alert and metric.
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_prometheus_prometheus_metrics_bloat"
|
||||
"warning_prometheus_prometheus_rule_group_evaluation"
|
||||
]
|
||||
```
|
||||
|
||||
@ -4933,6 +4935,114 @@ with your code hosts connections or networking issues affecting communication wi
|
||||
|
||||
<br />
|
||||
|
||||
## prometheus: alertmanager_config_status
|
||||
|
||||
<p class="subtitle">alertmanager configuration reload status (distribution)</p>
|
||||
|
||||
**Descriptions:**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> prometheus: less than 1 alertmanager configuration reload status
|
||||
|
||||
**Possible solutions:**
|
||||
|
||||
- Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.
|
||||
- **Refer to the [dashboards reference](./dashboards.md#prometheus-alertmanager-config-status)** for more help interpreting this alert and metric.
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_prometheus_alertmanager_config_status"
|
||||
]
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
## prometheus: prometheus_tsdb_op_failure
|
||||
|
||||
<p class="subtitle">prometheus tsdb failures by operation over 1m (distribution)</p>
|
||||
|
||||
**Descriptions:**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus tsdb failures by operation over 1m
|
||||
|
||||
**Possible solutions:**
|
||||
|
||||
- Check Prometheus logs for messages related to the failing operation.
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_prometheus_prometheus_tsdb_op_failure"
|
||||
]
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
## prometheus: prometheus_config_status
|
||||
|
||||
<p class="subtitle">prometheus configuration reload status (distribution)</p>
|
||||
|
||||
**Descriptions:**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> prometheus: less than 1 prometheus configuration reload status
|
||||
|
||||
**Possible solutions:**
|
||||
|
||||
- Check Prometheus logs for messages related to configuration loading.
|
||||
- Ensure any custom configuration you have provided Prometheus is valid.
|
||||
- **Refer to the [dashboards reference](./dashboards.md#prometheus-prometheus-config-status)** for more help interpreting this alert and metric.
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_prometheus_prometheus_config_status"
|
||||
]
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
## prometheus: prometheus_target_sample_exceeded
|
||||
|
||||
<p class="subtitle">prometheus scrapes that exceed the sample limit over 10m (distribution)</p>
|
||||
|
||||
**Descriptions:**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus scrapes that exceed the sample limit over 10m
|
||||
|
||||
**Possible solutions:**
|
||||
|
||||
- Check Prometheus logs for messages related to target scrape failures.
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_prometheus_prometheus_target_sample_exceeded"
|
||||
]
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
## prometheus: prometheus_target_sample_duplicate
|
||||
|
||||
<p class="subtitle">prometheus scrapes rejected due to duplicate timestamps over 10m (distribution)</p>
|
||||
|
||||
**Descriptions:**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus scrapes rejected due to duplicate timestamps over 10m
|
||||
|
||||
**Possible solutions:**
|
||||
|
||||
- Check Prometheus logs for messages related to target scrape failures.
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_prometheus_prometheus_target_sample_duplicate"
|
||||
]
|
||||
```
|
||||
|
||||
<br />
|
||||
|
||||
## prometheus: container_cpu_usage
|
||||
|
||||
<p class="subtitle">container cpu usage total (1m average) across all cores by instance (distribution)</p>
|
||||
|
||||
@ -2185,11 +2185,14 @@ Refer to the [alert solutions reference](./alert_solutions.md#zoekt-webserver-pr
|
||||
|
||||
### Prometheus: Metrics
|
||||
|
||||
#### prometheus: prometheus_metrics_bloat
|
||||
#### prometheus: prometheus_rule_group_evaluation
|
||||
|
||||
This distribution panel indicates prometheus metrics payload size.
|
||||
This distribution panel indicates average prometheus rule group evaluation duration over 10m.
|
||||
|
||||
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-metrics-bloat) for relevant alerts.
|
||||
A high value here indicates Prometheus rule evaluation is taking longer than expected.
|
||||
It might indicate that certain rule groups are taking too long to evaluate, or Prometheus is underprovisioned.
|
||||
|
||||
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-rule-group-evaluation) for relevant alerts.
|
||||
|
||||
<br />
|
||||
|
||||
@ -2203,6 +2206,52 @@ Refer to the [alert solutions reference](./alert_solutions.md#prometheus-alertma
|
||||
|
||||
<br />
|
||||
|
||||
#### prometheus: alertmanager_config_status
|
||||
|
||||
This distribution panel indicates alertmanager configuration reload status.
|
||||
|
||||
A `1` indicates Alertmanager reloaded its configuration successfully.
|
||||
|
||||
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-alertmanager-config-status) for relevant alerts.
|
||||
|
||||
<br />
|
||||
|
||||
### Prometheus: Prometheus internals
|
||||
|
||||
#### prometheus: prometheus_tsdb_op_failure
|
||||
|
||||
This distribution panel indicates prometheus tsdb failures by operation over 1m.
|
||||
|
||||
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-tsdb-op-failure) for relevant alerts.
|
||||
|
||||
<br />
|
||||
|
||||
#### prometheus: prometheus_config_status
|
||||
|
||||
This distribution panel indicates prometheus configuration reload status.
|
||||
|
||||
A `1` indicates Prometheus reloaded its configuration successfully.
|
||||
|
||||
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-config-status) for relevant alerts.
|
||||
|
||||
<br />
|
||||
|
||||
#### prometheus: prometheus_target_sample_exceeded
|
||||
|
||||
This distribution panel indicates prometheus scrapes that exceed the sample limit over 10m.
|
||||
|
||||
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-target-sample-exceeded) for relevant alerts.
|
||||
|
||||
<br />
|
||||
|
||||
#### prometheus: prometheus_target_sample_duplicate
|
||||
|
||||
This distribution panel indicates prometheus scrapes rejected due to duplicate timestamps over 10m.
|
||||
|
||||
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-target-sample-duplicate) for relevant alerts.
|
||||
|
||||
<br />
|
||||
|
||||
### Prometheus: Container monitoring (not available on server)
|
||||
|
||||
#### prometheus: container_cpu_usage
|
||||
|
||||
@ -16,13 +16,19 @@ func Prometheus() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "prometheus_metrics_bloat",
|
||||
Description: "prometheus metrics payload size",
|
||||
Query: `http_response_size_bytes{handler="prometheus",job!="kubernetes-apiservers",job!="kubernetes-nodes",quantile="0.5"}`,
|
||||
Warning: monitoring.Alert().GreaterOrEqual(20000),
|
||||
Panel: monitoring.Panel().Unit(monitoring.Bytes).LegendFormat("{{instance}}"),
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
PossibleSolutions: "none",
|
||||
Name: "prometheus_rule_group_evaluation",
|
||||
Description: "average prometheus rule group evaluation duration over 10m",
|
||||
Query: `sum by(rule_group) (avg_over_time(prometheus_rule_group_last_duration_seconds[10m]))`,
|
||||
Warning: monitoring.Alert().GreaterOrEqual(30), // standard prometheus_rule_group_interval_seconds
|
||||
Panel: monitoring.Panel().Unit(monitoring.Seconds).MinAuto().LegendFormat("{{rule_group}}"),
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
Interpretation: `
|
||||
A high value here indicates Prometheus rule evaluation is taking longer than expected.
|
||||
It might indicate that certain rule groups are taking too long to evaluate, or Prometheus is underprovisioned.
|
||||
`,
|
||||
PossibleSolutions: `
|
||||
- Try increasing resources for Prometheus.
|
||||
`,
|
||||
},
|
||||
},
|
||||
},
|
||||
@ -40,6 +46,66 @@ func Prometheus() *monitoring.Container {
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
PossibleSolutions: "Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.",
|
||||
},
|
||||
{
|
||||
Name: "alertmanager_config_status",
|
||||
Description: "alertmanager configuration reload status",
|
||||
Query: `alertmanager_config_last_reload_successful`,
|
||||
Warning: monitoring.Alert().Less(1),
|
||||
Panel: monitoring.Panel().LegendFormat("reload success").Max(1),
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
Interpretation: "A '1' indicates Alertmanager reloaded its configuration successfully.",
|
||||
PossibleSolutions: "Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
{
|
||||
Title: "Prometheus internals",
|
||||
Hidden: true,
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "prometheus_tsdb_op_failure",
|
||||
Description: "prometheus tsdb failures by operation over 1m",
|
||||
Query: `increase(label_replace({__name__=~"prometheus_tsdb_(.*)_failed_total"}, "operation", "$1", "__name__", "(.+)s_failed_total")[5m:1m])`,
|
||||
Warning: monitoring.Alert().Greater(0),
|
||||
Panel: monitoring.Panel().LegendFormat("{{operation}}"),
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
PossibleSolutions: "Check Prometheus logs for messages related to the failing operation.",
|
||||
},
|
||||
{
|
||||
Name: "prometheus_config_status",
|
||||
Description: "prometheus configuration reload status",
|
||||
Query: `prometheus_config_last_reload_successful`,
|
||||
Warning: monitoring.Alert().Less(1),
|
||||
Panel: monitoring.Panel().LegendFormat("reload success").Max(1),
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
Interpretation: "A '1' indicates Prometheus reloaded its configuration successfully.",
|
||||
PossibleSolutions: `
|
||||
- Check Prometheus logs for messages related to configuration loading.
|
||||
- Ensure any custom configuration you have provided Prometheus is valid.
|
||||
`,
|
||||
},
|
||||
},
|
||||
{
|
||||
{
|
||||
Name: "prometheus_target_sample_exceeded",
|
||||
Description: "prometheus scrapes that exceed the sample limit over 10m",
|
||||
Query: "increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m])",
|
||||
Warning: monitoring.Alert().Greater(0),
|
||||
Panel: monitoring.Panel().LegendFormat("rejected scrapes"),
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
PossibleSolutions: "Check Prometheus logs for messages related to target scrape failures.",
|
||||
},
|
||||
{
|
||||
Name: "prometheus_target_sample_duplicate",
|
||||
Description: "prometheus scrapes rejected due to duplicate timestamps over 10m",
|
||||
Query: "increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[10m])",
|
||||
Warning: monitoring.Alert().Greater(0),
|
||||
Panel: monitoring.Panel().LegendFormat("rejected scrapes"),
|
||||
Owner: monitoring.ObservableOwnerDistribution,
|
||||
PossibleSolutions: "Check Prometheus logs for messages related to target scrape failures.",
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
|
||||
Loading…
Reference in New Issue
Block a user