monitoring(prometheus): add new panels (#17246)

This commit is contained in:
Robert Lin 2021-01-15 12:12:37 +08:00 committed by GitHub
parent 57098f2cb3
commit d0b5dc5ba1
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 239 additions and 14 deletions

View File

@ -4892,21 +4892,23 @@ with your code hosts connections or networking issues affecting communication wi
<br />
## prometheus: prometheus_metrics_bloat
## prometheus: prometheus_rule_group_evaluation
<p class="subtitle">prometheus metrics payload size (distribution)</p>
<p class="subtitle">average prometheus rule group evaluation duration over 10m (distribution)</p>
**Descriptions:**
- <span class="badge badge-warning">warning</span> prometheus: 20000B+ prometheus metrics payload size
- <span class="badge badge-warning">warning</span> prometheus: 30s+ average prometheus rule group evaluation duration over 10m
**Possible solutions:**
- Try increasing resources for Prometheus.
- **Refer to the [dashboards reference](./dashboards.md#prometheus-prometheus-rule-group-evaluation)** for more help interpreting this alert and metric.
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
```json
"observability.silenceAlerts": [
"warning_prometheus_prometheus_metrics_bloat"
"warning_prometheus_prometheus_rule_group_evaluation"
]
```
@ -4933,6 +4935,114 @@ with your code hosts connections or networking issues affecting communication wi
<br />
## prometheus: alertmanager_config_status
<p class="subtitle">alertmanager configuration reload status (distribution)</p>
**Descriptions:**
- <span class="badge badge-warning">warning</span> prometheus: less than 1 alertmanager configuration reload status
**Possible solutions:**
- Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.
- **Refer to the [dashboards reference](./dashboards.md#prometheus-alertmanager-config-status)** for more help interpreting this alert and metric.
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
```json
"observability.silenceAlerts": [
"warning_prometheus_alertmanager_config_status"
]
```
<br />
## prometheus: prometheus_tsdb_op_failure
<p class="subtitle">prometheus tsdb failures by operation over 1m (distribution)</p>
**Descriptions:**
- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus tsdb failures by operation over 1m
**Possible solutions:**
- Check Prometheus logs for messages related to the failing operation.
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
```json
"observability.silenceAlerts": [
"warning_prometheus_prometheus_tsdb_op_failure"
]
```
<br />
## prometheus: prometheus_config_status
<p class="subtitle">prometheus configuration reload status (distribution)</p>
**Descriptions:**
- <span class="badge badge-warning">warning</span> prometheus: less than 1 prometheus configuration reload status
**Possible solutions:**
- Check Prometheus logs for messages related to configuration loading.
- Ensure any custom configuration you have provided Prometheus is valid.
- **Refer to the [dashboards reference](./dashboards.md#prometheus-prometheus-config-status)** for more help interpreting this alert and metric.
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
```json
"observability.silenceAlerts": [
"warning_prometheus_prometheus_config_status"
]
```
<br />
## prometheus: prometheus_target_sample_exceeded
<p class="subtitle">prometheus scrapes that exceed the sample limit over 10m (distribution)</p>
**Descriptions:**
- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus scrapes that exceed the sample limit over 10m
**Possible solutions:**
- Check Prometheus logs for messages related to target scrape failures.
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
```json
"observability.silenceAlerts": [
"warning_prometheus_prometheus_target_sample_exceeded"
]
```
<br />
## prometheus: prometheus_target_sample_duplicate
<p class="subtitle">prometheus scrapes rejected due to duplicate timestamps over 10m (distribution)</p>
**Descriptions:**
- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus scrapes rejected due to duplicate timestamps over 10m
**Possible solutions:**
- Check Prometheus logs for messages related to target scrape failures.
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
```json
"observability.silenceAlerts": [
"warning_prometheus_prometheus_target_sample_duplicate"
]
```
<br />
## prometheus: container_cpu_usage
<p class="subtitle">container cpu usage total (1m average) across all cores by instance (distribution)</p>

View File

@ -2185,11 +2185,14 @@ Refer to the [alert solutions reference](./alert_solutions.md#zoekt-webserver-pr
### Prometheus: Metrics
#### prometheus: prometheus_metrics_bloat
#### prometheus: prometheus_rule_group_evaluation
This distribution panel indicates prometheus metrics payload size.
This distribution panel indicates average prometheus rule group evaluation duration over 10m.
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-metrics-bloat) for relevant alerts.
A high value here indicates Prometheus rule evaluation is taking longer than expected.
It might indicate that certain rule groups are taking too long to evaluate, or Prometheus is underprovisioned.
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-rule-group-evaluation) for relevant alerts.
<br />
@ -2203,6 +2206,52 @@ Refer to the [alert solutions reference](./alert_solutions.md#prometheus-alertma
<br />
#### prometheus: alertmanager_config_status
This distribution panel indicates alertmanager configuration reload status.
A `1` indicates Alertmanager reloaded its configuration successfully.
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-alertmanager-config-status) for relevant alerts.
<br />
### Prometheus: Prometheus internals
#### prometheus: prometheus_tsdb_op_failure
This distribution panel indicates prometheus tsdb failures by operation over 1m.
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-tsdb-op-failure) for relevant alerts.
<br />
#### prometheus: prometheus_config_status
This distribution panel indicates prometheus configuration reload status.
A `1` indicates Prometheus reloaded its configuration successfully.
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-config-status) for relevant alerts.
<br />
#### prometheus: prometheus_target_sample_exceeded
This distribution panel indicates prometheus scrapes that exceed the sample limit over 10m.
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-target-sample-exceeded) for relevant alerts.
<br />
#### prometheus: prometheus_target_sample_duplicate
This distribution panel indicates prometheus scrapes rejected due to duplicate timestamps over 10m.
Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-target-sample-duplicate) for relevant alerts.
<br />
### Prometheus: Container monitoring (not available on server)
#### prometheus: container_cpu_usage

View File

@ -16,13 +16,19 @@ func Prometheus() *monitoring.Container {
Rows: []monitoring.Row{
{
{
Name: "prometheus_metrics_bloat",
Description: "prometheus metrics payload size",
Query: `http_response_size_bytes{handler="prometheus",job!="kubernetes-apiservers",job!="kubernetes-nodes",quantile="0.5"}`,
Warning: monitoring.Alert().GreaterOrEqual(20000),
Panel: monitoring.Panel().Unit(monitoring.Bytes).LegendFormat("{{instance}}"),
Owner: monitoring.ObservableOwnerDistribution,
PossibleSolutions: "none",
Name: "prometheus_rule_group_evaluation",
Description: "average prometheus rule group evaluation duration over 10m",
Query: `sum by(rule_group) (avg_over_time(prometheus_rule_group_last_duration_seconds[10m]))`,
Warning: monitoring.Alert().GreaterOrEqual(30), // standard prometheus_rule_group_interval_seconds
Panel: monitoring.Panel().Unit(monitoring.Seconds).MinAuto().LegendFormat("{{rule_group}}"),
Owner: monitoring.ObservableOwnerDistribution,
Interpretation: `
A high value here indicates Prometheus rule evaluation is taking longer than expected.
It might indicate that certain rule groups are taking too long to evaluate, or Prometheus is underprovisioned.
`,
PossibleSolutions: `
- Try increasing resources for Prometheus.
`,
},
},
},
@ -40,6 +46,66 @@ func Prometheus() *monitoring.Container {
Owner: monitoring.ObservableOwnerDistribution,
PossibleSolutions: "Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.",
},
{
Name: "alertmanager_config_status",
Description: "alertmanager configuration reload status",
Query: `alertmanager_config_last_reload_successful`,
Warning: monitoring.Alert().Less(1),
Panel: monitoring.Panel().LegendFormat("reload success").Max(1),
Owner: monitoring.ObservableOwnerDistribution,
Interpretation: "A '1' indicates Alertmanager reloaded its configuration successfully.",
PossibleSolutions: "Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.",
},
},
},
},
{
Title: "Prometheus internals",
Hidden: true,
Rows: []monitoring.Row{
{
{
Name: "prometheus_tsdb_op_failure",
Description: "prometheus tsdb failures by operation over 1m",
Query: `increase(label_replace({__name__=~"prometheus_tsdb_(.*)_failed_total"}, "operation", "$1", "__name__", "(.+)s_failed_total")[5m:1m])`,
Warning: monitoring.Alert().Greater(0),
Panel: monitoring.Panel().LegendFormat("{{operation}}"),
Owner: monitoring.ObservableOwnerDistribution,
PossibleSolutions: "Check Prometheus logs for messages related to the failing operation.",
},
{
Name: "prometheus_config_status",
Description: "prometheus configuration reload status",
Query: `prometheus_config_last_reload_successful`,
Warning: monitoring.Alert().Less(1),
Panel: monitoring.Panel().LegendFormat("reload success").Max(1),
Owner: monitoring.ObservableOwnerDistribution,
Interpretation: "A '1' indicates Prometheus reloaded its configuration successfully.",
PossibleSolutions: `
- Check Prometheus logs for messages related to configuration loading.
- Ensure any custom configuration you have provided Prometheus is valid.
`,
},
},
{
{
Name: "prometheus_target_sample_exceeded",
Description: "prometheus scrapes that exceed the sample limit over 10m",
Query: "increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m])",
Warning: monitoring.Alert().Greater(0),
Panel: monitoring.Panel().LegendFormat("rejected scrapes"),
Owner: monitoring.ObservableOwnerDistribution,
PossibleSolutions: "Check Prometheus logs for messages related to target scrape failures.",
},
{
Name: "prometheus_target_sample_duplicate",
Description: "prometheus scrapes rejected due to duplicate timestamps over 10m",
Query: "increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[10m])",
Warning: monitoring.Alert().Greater(0),
Panel: monitoring.Panel().LegendFormat("rejected scrapes"),
Owner: monitoring.ObservableOwnerDistribution,
PossibleSolutions: "Check Prometheus logs for messages related to target scrape failures.",
},
},
},
},