monitoring(prometheus): add new panels (#17246)

2026-02-06 17:51:57 +00:00 · 2021-01-15 12:12:37 +08:00 · 2021-01-15 12:12:37 +08:00 · d0b5dc5ba1
commit d0b5dc5ba1
parent 57098f2cb3
3 changed files with 239 additions and 14 deletions
--- a/doc/admin/observability/alert_solutions.md
+++ b/doc/admin/observability/alert_solutions.md
@ -4892,21 +4892,23 @@ with your code hosts connections or networking issues affecting communication wi

 <br />

-## prometheus: prometheus_metrics_bloat
+## prometheus: prometheus_rule_group_evaluation

-<p class="subtitle">prometheus metrics payload size (distribution)</p>
+<p class="subtitle">average prometheus rule group evaluation duration over 10m (distribution)</p>

 **Descriptions:**

- <span class="badge badge-warning">warning</span> prometheus: 20000B+ prometheus metrics payload size
+- <span class="badge badge-warning">warning</span> prometheus: 30s+ average prometheus rule group evaluation duration over 10m

 **Possible solutions:**

+- Try increasing resources for Prometheus.
+- **Refer to the [dashboards reference](./dashboards.md#prometheus-prometheus-rule-group-evaluation)** for more help interpreting this alert and metric.
 - **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:

 ```json
 "observability.silenceAlerts": [
-  "warning_prometheus_prometheus_metrics_bloat"
+  "warning_prometheus_prometheus_rule_group_evaluation"
 ]
 ```

@ -4933,6 +4935,114 @@ with your code hosts connections or networking issues affecting communication wi

 <br />

+## prometheus: alertmanager_config_status
+
+<p class="subtitle">alertmanager configuration reload status (distribution)</p>
+
+**Descriptions:**
+
+- <span class="badge badge-warning">warning</span> prometheus: less than 1 alertmanager configuration reload status
+
+**Possible solutions:**
+
+- Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.
+- **Refer to the [dashboards reference](./dashboards.md#prometheus-alertmanager-config-status)** for more help interpreting this alert and metric.
+- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
+
+```json
+"observability.silenceAlerts": [
+  "warning_prometheus_alertmanager_config_status"
+]
+```
+
+<br />
+
+## prometheus: prometheus_tsdb_op_failure
+
+<p class="subtitle">prometheus tsdb failures by operation over 1m (distribution)</p>
+
+**Descriptions:**
+
+- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus tsdb failures by operation over 1m
+
+**Possible solutions:**
+
+- Check Prometheus logs for messages related to the failing operation.
+- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
+
+```json
+"observability.silenceAlerts": [
+  "warning_prometheus_prometheus_tsdb_op_failure"
+]
+```
+
+<br />
+
+## prometheus: prometheus_config_status
+
+<p class="subtitle">prometheus configuration reload status (distribution)</p>
+
+**Descriptions:**
+
+- <span class="badge badge-warning">warning</span> prometheus: less than 1 prometheus configuration reload status
+
+**Possible solutions:**
+
+- Check Prometheus logs for messages related to configuration loading.
+- Ensure any custom configuration you have provided Prometheus is valid.
+- **Refer to the [dashboards reference](./dashboards.md#prometheus-prometheus-config-status)** for more help interpreting this alert and metric.
+- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
+
+```json
+"observability.silenceAlerts": [
+  "warning_prometheus_prometheus_config_status"
+]
+```
+
+<br />
+
+## prometheus: prometheus_target_sample_exceeded
+
+<p class="subtitle">prometheus scrapes that exceed the sample limit over 10m (distribution)</p>
+
+**Descriptions:**
+
+- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus scrapes that exceed the sample limit over 10m
+
+**Possible solutions:**
+
+- Check Prometheus logs for messages related to target scrape failures.
+- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
+
+```json
+"observability.silenceAlerts": [
+  "warning_prometheus_prometheus_target_sample_exceeded"
+]
+```
+
+<br />
+
+## prometheus: prometheus_target_sample_duplicate
+
+<p class="subtitle">prometheus scrapes rejected due to duplicate timestamps over 10m (distribution)</p>
+
+**Descriptions:**
+
+- <span class="badge badge-warning">warning</span> prometheus: 0+ prometheus scrapes rejected due to duplicate timestamps over 10m
+
+**Possible solutions:**
+
+- Check Prometheus logs for messages related to target scrape failures.
+- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
+
+```json
+"observability.silenceAlerts": [
+  "warning_prometheus_prometheus_target_sample_duplicate"
+]
+```
+
+<br />
+
 ## prometheus: container_cpu_usage

 <p class="subtitle">container cpu usage total (1m average) across all cores by instance (distribution)</p>
--- a/doc/admin/observability/dashboards.md
+++ b/doc/admin/observability/dashboards.md
@ -2185,11 +2185,14 @@ Refer to the [alert solutions reference](./alert_solutions.md#zoekt-webserver-pr

 ### Prometheus: Metrics

-#### prometheus: prometheus_metrics_bloat
+#### prometheus: prometheus_rule_group_evaluation

-This distribution panel indicates prometheus metrics payload size.
+This distribution panel indicates average prometheus rule group evaluation duration over 10m.

-Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-metrics-bloat) for relevant alerts.
+A high value here indicates Prometheus rule evaluation is taking longer than expected.
+It might indicate that certain rule groups are taking too long to evaluate, or Prometheus is underprovisioned.
+
+Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-rule-group-evaluation) for relevant alerts.

 <br />

@ -2203,6 +2206,52 @@ Refer to the [alert solutions reference](./alert_solutions.md#prometheus-alertma

 <br />

+#### prometheus: alertmanager_config_status
+
+This distribution panel indicates alertmanager configuration reload status.
+
+A `1` indicates Alertmanager reloaded its configuration successfully.
+
+Refer to the [alert solutions reference](./alert_solutions.md#prometheus-alertmanager-config-status) for relevant alerts.
+
+<br />
+
+### Prometheus: Prometheus internals
+
+#### prometheus: prometheus_tsdb_op_failure
+
+This distribution panel indicates prometheus tsdb failures by operation over 1m.
+
+Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-tsdb-op-failure) for relevant alerts.
+
+<br />
+
+#### prometheus: prometheus_config_status
+
+This distribution panel indicates prometheus configuration reload status.
+
+A `1` indicates Prometheus reloaded its configuration successfully.
+
+Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-config-status) for relevant alerts.
+
+<br />
+
+#### prometheus: prometheus_target_sample_exceeded
+
+This distribution panel indicates prometheus scrapes that exceed the sample limit over 10m.
+
+Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-target-sample-exceeded) for relevant alerts.
+
+<br />
+
+#### prometheus: prometheus_target_sample_duplicate
+
+This distribution panel indicates prometheus scrapes rejected due to duplicate timestamps over 10m.
+
+Refer to the [alert solutions reference](./alert_solutions.md#prometheus-prometheus-target-sample-duplicate) for relevant alerts.
+
+<br />
+
 ### Prometheus: Container monitoring (not available on server)

 #### prometheus: container_cpu_usage
--- a/monitoring/definitions/prometheus.go
+++ b/monitoring/definitions/prometheus.go
@ -16,13 +16,19 @@ func Prometheus() *monitoring.Container {
 				Rows: []monitoring.Row{
 					{
 						{
-							Name:              "prometheus_metrics_bloat",
-							Description:       "prometheus metrics payload size",
-							Query:             `http_response_size_bytes{handler="prometheus",job!="kubernetes-apiservers",job!="kubernetes-nodes",quantile="0.5"}`,
-							Warning:           monitoring.Alert().GreaterOrEqual(20000),
-							Panel:             monitoring.Panel().Unit(monitoring.Bytes).LegendFormat("{{instance}}"),
-							Owner:             monitoring.ObservableOwnerDistribution,
-							PossibleSolutions: "none",
+							Name:        "prometheus_rule_group_evaluation",
+							Description: "average prometheus rule group evaluation duration over 10m",
+							Query:       `sum by(rule_group) (avg_over_time(prometheus_rule_group_last_duration_seconds[10m]))`,
+							Warning:     monitoring.Alert().GreaterOrEqual(30), // standard prometheus_rule_group_interval_seconds
+							Panel:       monitoring.Panel().Unit(monitoring.Seconds).MinAuto().LegendFormat("{{rule_group}}"),
+							Owner:       monitoring.ObservableOwnerDistribution,
+							Interpretation: `
+								A high value here indicates Prometheus rule evaluation is taking longer than expected.
+								It might indicate that certain rule groups are taking too long to evaluate, or Prometheus is underprovisioned.
+							`,
+							PossibleSolutions: `
+								- Try increasing resources for Prometheus.
+							`,
 						},
 					},
 				},
@ -40,6 +46,66 @@ func Prometheus() *monitoring.Container {
 							Owner:             monitoring.ObservableOwnerDistribution,
 							PossibleSolutions: "Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.",
 						},
+						{
+							Name:              "alertmanager_config_status",
+							Description:       "alertmanager configuration reload status",
+							Query:             `alertmanager_config_last_reload_successful`,
+							Warning:           monitoring.Alert().Less(1),
+							Panel:             monitoring.Panel().LegendFormat("reload success").Max(1),
+							Owner:             monitoring.ObservableOwnerDistribution,
+							Interpretation:    "A '1' indicates Alertmanager reloaded its configuration successfully.",
+							PossibleSolutions: "Ensure that your [`observability.alerts` configuration](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) (in site configuration) is valid.",
+						},
+					},
+				},
+			},
+			{
+				Title:  "Prometheus internals",
+				Hidden: true,
+				Rows: []monitoring.Row{
+					{
+						{
+							Name:              "prometheus_tsdb_op_failure",
+							Description:       "prometheus tsdb failures by operation over 1m",
+							Query:             `increase(label_replace({__name__=~"prometheus_tsdb_(.*)_failed_total"}, "operation", "$1", "__name__", "(.+)s_failed_total")[5m:1m])`,
+							Warning:           monitoring.Alert().Greater(0),
+							Panel:             monitoring.Panel().LegendFormat("{{operation}}"),
+							Owner:             monitoring.ObservableOwnerDistribution,
+							PossibleSolutions: "Check Prometheus logs for messages related to the failing operation.",
+						},
+						{
+							Name:           "prometheus_config_status",
+							Description:    "prometheus configuration reload status",
+							Query:          `prometheus_config_last_reload_successful`,
+							Warning:        monitoring.Alert().Less(1),
+							Panel:          monitoring.Panel().LegendFormat("reload success").Max(1),
+							Owner:          monitoring.ObservableOwnerDistribution,
+							Interpretation: "A '1' indicates Prometheus reloaded its configuration successfully.",
+							PossibleSolutions: `
+								- Check Prometheus logs for messages related to configuration loading.
+								- Ensure any custom configuration you have provided Prometheus is valid.
+							`,
+						},
+					},
+					{
+						{
+							Name:              "prometheus_target_sample_exceeded",
+							Description:       "prometheus scrapes that exceed the sample limit over 10m",
+							Query:             "increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m])",
+							Warning:           monitoring.Alert().Greater(0),
+							Panel:             monitoring.Panel().LegendFormat("rejected scrapes"),
+							Owner:             monitoring.ObservableOwnerDistribution,
+							PossibleSolutions: "Check Prometheus logs for messages related to target scrape failures.",
+						},
+						{
+							Name:              "prometheus_target_sample_duplicate",
+							Description:       "prometheus scrapes rejected due to duplicate timestamps over 10m",
+							Query:             "increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[10m])",
+							Warning:           monitoring.Alert().Greater(0),
+							Panel:             monitoring.Panel().LegendFormat("rejected scrapes"),
+							Owner:             monitoring.ObservableOwnerDistribution,
+							PossibleSolutions: "Check Prometheus logs for messages related to target scrape failures.",
+						},
 					},
 				},
 			},