mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 20:51:43 +00:00
monitoring: dashboards reference, refresh observability docs (#16939)
- New monitoring definition field, 'Interpretation' - New generated dashboards reference - Grafana panels now link to alert solutions and dashboards reference - Refreshed observability docs Co-authored-by: uwedeportivo <534011+uwedeportivo@users.noreply.github.com>
This commit is contained in:
parent
610782fd3b
commit
ef7f19a756
@ -15,7 +15,7 @@ All notable changes to Sourcegraph are documented in this file.
|
||||
|
||||
### Added
|
||||
|
||||
-
|
||||
- Panels in the Sourcegraph monitoring dashboards now have links to relevant alerts documentation and the new [monitoring dashboards reference](https://docs.sourcegraph.com/admin/observability/dashboards). [#16939](https://github.com/sourcegraph/sourcegraph/pull/16939)
|
||||
|
||||
### Changed
|
||||
|
||||
|
||||
2
cmd/frontend/graphqlbackend/schema.go
generated
2
cmd/frontend/graphqlbackend/schema.go
generated
@ -7655,7 +7655,7 @@ type MonitoringStatistics {
|
||||
}
|
||||
|
||||
"""
|
||||
A high-level monitoring alert, for details see https://docs.sourcegraph.com/admin/observability/metrics_guide#high-level-alerting-metrics
|
||||
A high-level monitoring alert, for details see https://docs.sourcegraph.com/admin/observability/metrics#high-level-alerting-metrics
|
||||
"""
|
||||
type MonitoringAlert {
|
||||
"""
|
||||
|
||||
@ -7648,7 +7648,7 @@ type MonitoringStatistics {
|
||||
}
|
||||
|
||||
"""
|
||||
A high-level monitoring alert, for details see https://docs.sourcegraph.com/admin/observability/metrics_guide#high-level-alerting-metrics
|
||||
A high-level monitoring alert, for details see https://docs.sourcegraph.com/admin/observability/metrics#high-level-alerting-metrics
|
||||
"""
|
||||
type MonitoringAlert {
|
||||
"""
|
||||
|
||||
@ -25,6 +25,8 @@ The document content can set class names and IDs on elements (for example, Markd
|
||||
|
||||
--note-color: #bce8f1;
|
||||
--warning-color: #faebcc;
|
||||
--warning-badge-color: #f59f00;
|
||||
--critical-badge-color: #f03e3e;
|
||||
--experimental-color: #b200f8;
|
||||
|
||||
--table-row-bg-1: var(--body-bg);
|
||||
@ -430,6 +432,16 @@ body > #page > main > #content {
|
||||
background-color: var(--experimental-color);
|
||||
}
|
||||
|
||||
.badge-warning {
|
||||
color: #ffffff;
|
||||
background-color: var(--warning-badge-color);
|
||||
}
|
||||
|
||||
.badge-critical {
|
||||
color: #ffffff;
|
||||
background-color: var(--critical-badge-color);
|
||||
}
|
||||
|
||||
/* MARKDOWN */
|
||||
.markdown-body {
|
||||
text-size-adjust: 100%;
|
||||
|
||||
@ -108,6 +108,14 @@ If you are running Sourcegraph as a Kubernetes cluster, you have two additional
|
||||
modify
|
||||
[`nginx.ConfigMap.yaml`](https://github.com/sourcegraph/deploy-sourcegraph/blob/master/configure/nginx-svc/nginx.ConfigMap.yaml).
|
||||
|
||||
## Can I consume Sourcegraph's metrics in my own monitoring system (Datadog, New Relic, etc.)?
|
||||
|
||||
Sourcegraph provides [high-level alerting metrics](./observability/metrics.md#high-level-alerting-metrics) which you can integrate into your own monitoring system - see the [alerting custom consumption guide](./observability/alerting_custom_consumption.md) for more details.
|
||||
|
||||
While it is technically possible to consume all of Sourcegraph's metrics in an external system, our recommendation is to utilize the builtin monitoring tools and configure Sourcegraph to [send alerts to your own PagerDuty, Slack, email, etc.](./observability/alerting.md). Metrics and thresholds can change with each release, therefore manually defining the alerts required to monitor Sourcegraph's health is not recommended. Sourcegraph automatically updates the dashboards and alerts on each release to ensure the displayed information is up-to-date.
|
||||
|
||||
Other monitoring systems that support Prometheus scraping (for example, Datadog and New Relic) or [Prometheus federation](https://prometheus.io/docs/prometheus/latest/federation/) can be configured to federate Sourcegraph's [high-level alerting metrics](./observability/metrics.md#high-level-alerting-metrics). For information on how to configure those systems, please check your provider's documentation.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Content moved to a [dedicated troubleshooting page](troubleshooting.md).
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@ -1,14 +1,14 @@
|
||||
# Alerts
|
||||
# Alerting
|
||||
|
||||
Alerts can be configured to notify site admins when there is something wrong or noteworthy on the Sourcegraph instance.
|
||||
|
||||
## Understanding alerts
|
||||
|
||||
See [alert solutions](alert_solutions.md) for possible solutions when alerts are firing, and learn more about alert labels and metrics in our [metrics guide](metrics_guide.md).
|
||||
See [alert solutions](alert_solutions.md) for possible solutions when alerts are firing, and learn more about alert labels, metrics, and dashboards in our [metrics guide](metrics.md).
|
||||
|
||||
## Setting up alerting
|
||||
|
||||
Visit your site configuration (e.g. `https://sourcegraph.example.com/site-admin/configuration`) to configure alerts using the `observability.alerts` field. As always, you can use `Ctrl+Space` at any time to get hints about allowed fields as well as relevant documentation inside the configuration editor.
|
||||
Visit your site configuration (e.g. `https://sourcegraph.example.com/site-admin/configuration`) to configure alerts using the [`observability.alerts`](../config/site_config.md#observability-alerts) field. As always, you can use `Ctrl+Space` at any time to get hints about allowed fields as well as relevant documentation inside the configuration editor.
|
||||
|
||||
Once configured, Sourcegraph alerts will automatically be routed to the appropriate notification channels by severity level.
|
||||
|
||||
@ -113,7 +113,7 @@ The test alert may take up to a minute to fire. The triggered alert will automat
|
||||
|
||||
### Silencing alerts
|
||||
|
||||
If there is an alert you are aware of and you wish to silence notifications for it, add an entry to the `observability.silenceAlerts` field. For example:
|
||||
If there is an alert you are aware of and you wish to silence notifications for it, add an entry to the [`observability.silenceAlerts`](../config/site_config.md#observability-silenceAlerts)field. For example:
|
||||
|
||||
```json
|
||||
{
|
||||
|
||||
@ -2,7 +2,7 @@
|
||||
|
||||
If Sourcegraph's builtin [alerting](alerting.md) (which can notify you via email, Slack, PagerDuty, webhook, and more) is not sufficient for you, or if you just prefer to consume the alerts programatically for some reason, then this page is for you.
|
||||
|
||||
For more information about Sourcegraph alerts, see: [high level alerting metrics](metrics_guide.md#high-level-alerting-metrics).
|
||||
For more information about Sourcegraph alerts, see [high level alerting metrics](metrics.md#high-level-alerting-metrics).
|
||||
|
||||
## Prometheus queries
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
5
doc/admin/observability/health_checks.md
Normal file
5
doc/admin/observability/health_checks.md
Normal file
@ -0,0 +1,5 @@
|
||||
# Health checks
|
||||
|
||||
An application health check status endpoint is available at the URL path `/healthz`. It returns HTTP 200 if and only if the main frontend server and databases (PostgreSQL and Redis) are available.
|
||||
|
||||
The [Kubernetes cluster deployment option](../install/kubernetes/index.md) ships with comprehensive health checks for each Kubernetes deployment.
|
||||
@ -1,60 +1,44 @@
|
||||
# Observability
|
||||
|
||||
Sourcegraph is designed to meet enterprise production readiness criteria. A key pillar of production
|
||||
readiness is the ability to observe, monitor, and analyze the health and state of the
|
||||
system.
|
||||
<p class="lead">
|
||||
A key pillar of production readiness is the ability to observe, monitor, and analyze the health and state of the system.
|
||||
Sourcegraph is designed, and ships with, a number of observability tools and capabilities to meet enterprise production readiness criteria.
|
||||
</p>
|
||||
|
||||
> NOTE: If you're using the [Kubernetes cluster deployment
|
||||
> option](https://github.com/sourcegraph/deploy-sourcegraph), see the [Kubernetes cluster
|
||||
> administrator
|
||||
> guide](https://github.com/sourcegraph/deploy-sourcegraph/blob/master/docs/admin-guide.md) for more
|
||||
> information.
|
||||
<div class="getting-started">
|
||||
<a href="metrics" class="btn" alt="Run through the Quickstart guide">
|
||||
<span>Metrics and dashboards</span>
|
||||
</br>
|
||||
Learn about the metrics and dashboards provided out-of-the-box.
|
||||
</a>
|
||||
|
||||
Sourcegraph ships with a number of observability tools and capabilities:
|
||||
<a href="alerting" class="btn" alt="Set up alerting">
|
||||
<span>Alerting</span>
|
||||
</br>
|
||||
Receive notifications about the health of your Sourcegraph deployment.
|
||||
</a>
|
||||
|
||||
* [Metrics and dashboards](metrics.md) via Prometheus and Grafana
|
||||
* [Tracing](tracing.md)
|
||||
* [Alerting](alerting.md)
|
||||
* [Alerting: custom consumption](alerting_custom_consumption.md)
|
||||
* [Logs](#logs)
|
||||
* [Health checks](#health-checks)
|
||||
* [Other tools](#other-tools)
|
||||
<a href="troubleshooting" class="btn" alt="Set up alerting">
|
||||
<span>Troubleshooting</span>
|
||||
</br>
|
||||
Investigate a specific production issue with our troubleshooting guide.
|
||||
</a>
|
||||
</div>
|
||||
|
||||
If you are investigating a specific production issue, consult the [troubleshooting guide](troubleshooting.md).
|
||||
## Guides
|
||||
|
||||
## Logs
|
||||
* [Metrics and dashboards](metrics.md)
|
||||
* [Alerting](./alerting.md)
|
||||
* [Tracing](./tracing.md)
|
||||
* [Logs](./logs.md)
|
||||
* [Health checks](./health_checks.md)
|
||||
* [Troubleshooting guide](troubleshooting.md)
|
||||
|
||||
### Log levels
|
||||
## Reference
|
||||
|
||||
A Sourcegraph service's log level is configured via the environment variable `SRC_LOG_LEVEL`. The valid values (from most to least verbose) are:
|
||||
|
||||
* `dbug`: Debug. Output all logs. Default in cluster deployments.
|
||||
* `info`: Informational.
|
||||
* `warn`: Warning. Default in Docker deployments.
|
||||
* `eror`: Error.
|
||||
* `crit`: Critical.
|
||||
|
||||
### Log format
|
||||
|
||||
A Sourcegraph service's log output format is configured via the environment variable `SRC_LOG_FORMAT`. The valid values are:
|
||||
|
||||
* `condensed`: Optimized for human readability.
|
||||
* `json`: Machine-readable JSON format.
|
||||
* `logfmt`: The [logfmt](https://github.com/kr/logfmt) format.
|
||||
|
||||
## Health checks
|
||||
|
||||
An application health check status endpoint is available at the URL path `/healthz`. It returns HTTP 200 if and only if the main frontend server and databases (PostgreSQL and Redis) are available.
|
||||
|
||||
The [Kubernetes cluster deployment option](https://github.com/sourcegraph/deploy-sourcegraph) ships with comprehensive health checks for each Kubernetes deployment.
|
||||
|
||||
## Other tools
|
||||
|
||||
- [Sentry](https://sentry.io) error reporting, configured via the `sentry` property in [site configuration](../config/site_config.md)
|
||||
- [Go net/trace](#viewing-go-net-trace-information)
|
||||
- [Honeycomb](https://honeycomb.io/)
|
||||
* [Dashboards reference](./dashboards.md)
|
||||
* [Alert solutions](./alert_solutions.md)
|
||||
|
||||
## Support
|
||||
|
||||
For help configuring monitoring and tracing on your Sourcegraph instance, use our [public issue
|
||||
tracker](https://github.com/sourcegraph/issues/issues).
|
||||
For help configuring observability on your Sourcegraph instance, use our [public issue tracker](https://github.com/sourcegraph/issues/issues).
|
||||
|
||||
19
doc/admin/observability/logs.md
Normal file
19
doc/admin/observability/logs.md
Normal file
@ -0,0 +1,19 @@
|
||||
# Logs
|
||||
|
||||
## Log levels
|
||||
|
||||
A Sourcegraph service's log level is configured via the environment variable `SRC_LOG_LEVEL`. The valid values (from most to least verbose) are:
|
||||
|
||||
* `dbug`: Debug. Output all logs. Default in cluster deployments.
|
||||
* `info`: Informational.
|
||||
* `warn`: Warning. Default in Docker deployments.
|
||||
* `eror`: Error.
|
||||
* `crit`: Critical.
|
||||
|
||||
## Log format
|
||||
|
||||
A Sourcegraph service's log output format is configured via the environment variable `SRC_LOG_FORMAT`. The valid values are:
|
||||
|
||||
* `condensed`: Optimized for human readability.
|
||||
* `json`: Machine-readable JSON format.
|
||||
* `logfmt`: The [logfmt](https://github.com/kr/logfmt) format.
|
||||
@ -1,113 +1,130 @@
|
||||
# Metrics
|
||||
# Metrics and dashboards
|
||||
|
||||
Sourcegraph uses [Prometheus](https://prometheus.io/) for metrics and [Grafana](https://grafana.com) for metrics dashboards.
|
||||
|
||||
If you're using the [Kubernetes cluster deployment
|
||||
option](https://github.com/sourcegraph/deploy-sourcegraph), see the [Prometheus
|
||||
README](https://github.com/sourcegraph/deploy-sourcegraph/blob/master/configure/prometheus/README.md)
|
||||
for more information.
|
||||
|
||||
## Prometheus
|
||||
|
||||
Prometheus is a monitoring tool that collects application- and system-level metrics over time and
|
||||
makes these accessible through a query language and simple UI.
|
||||
|
||||
### Accessing Prometheus
|
||||
|
||||
Most of the time, Sourcegraph site admins will monitor key metrics through the Grafana UI, rather
|
||||
than through Prometheus directly. Grafana provides the dashboards that monitor the standard metrics
|
||||
that indicate the health of the instance. Only if an admin wants to write a novel metrics formula or
|
||||
query do they need to access the Prometheus UI.
|
||||
|
||||
If you are using single-container Sourcegraph, you will need to restart the Sourcegraph container
|
||||
with a flag `--publish 9090:9090` in the `docker run` command. Subsequently, you can access
|
||||
Prometheus at http://localhost:9090.
|
||||
|
||||
If you are using the Sourcegraph Kubernetes Cluster, port-forward the Prometheus service:
|
||||
|
||||
```
|
||||
kubectl port-forward svc/prometheus 9090:30090
|
||||
```
|
||||
|
||||
#### Configuration
|
||||
|
||||
Sourcegraph runs a slightly customized image of Prometheus, which packages a standard Prometheus
|
||||
installation together with rules files and target files tailored to Sourcegraph.
|
||||
|
||||
A directory can be mounted at `/sg_prometheus_add_ons`. It can contain additional config files of two types:
|
||||
- rule files which must have the suffix `_rules.yml` in their filename (ie `gitserver_rules.yml`)
|
||||
- target files which must have the suffix `_targets.yml` in their filename (ie `local_targets.yml`)
|
||||
|
||||
[Rule files](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)
|
||||
and [target files](https://prometheus.io/docs/guides/file-sd/) must use the latest Prometheus 2.x syntax.
|
||||
|
||||
The environment variable `PROMETHEUS_ADDITIONAL_FLAGS` can be used to pass on additional flags to the `prometheus` executable running in the container.
|
||||
Sourcegraph ships with [Grafana](https://grafana.com) for dashboards, [Prometheus](https://prometheus.io/) for metrics and alerting. We also provide [built-in alerting](./alerting.md) for these metrics.
|
||||
|
||||
## Grafana
|
||||
|
||||
Site admins can view the monitoring dashboards on a Sourcegraph instance:
|
||||
Site admins can view the Grafana monitoring dashboards on a Sourcegraph instance:
|
||||
|
||||
1. Go to **User menu > Site admin**.
|
||||
1. Open the **Monitoring** page (left sidebar). The URL is
|
||||
`https://sourcegraph.example.com/-/debug/grafana/?orgId=1`.
|
||||
1. Read the [Sourcegraph Grafana dashboard descriptions](dashboards.md) before exploring
|
||||
the dashboards.
|
||||
1. Open the **Monitoring** page from the link in the left sidebar. The URL is `https://sourcegraph.example.com/-/debug/grafana/`.
|
||||
|
||||
> NOTE: There is a [known issue](https://github.com/sourcegraph/sourcegraph/issues/6075) where
|
||||
> attempting to edit a dashboard will result in a 403 response with "invalid CSRF token". As a
|
||||
> workaround, site admins can connect to Grafana directly (described below) to edit the dashboards.
|
||||
<img src="https://user-images.githubusercontent.com/3173176/82078081-65c62780-9695-11ea-954a-84e8e9686970.png" class="screenshot" alt="Sourcegraph dashboard">
|
||||
|
||||
### Available dashboards
|
||||
|
||||
A complete [dashboard reference](dashboards.md) is available for more context on our available dashboards.
|
||||
|
||||
### Grafana configuration
|
||||
|
||||
Sourcegraph deploys a customized image of Grafana, which ships with Sourcegraph-specific dashboard definitions.
|
||||
|
||||
To provide custom dashboards, a directory containing dashboard JSON specifications can be mounted in the Docker container at `/sg_grafana_additional_dashboards`.
|
||||
Changes to files in that directory will be detected automatically while Grafana is running.
|
||||
|
||||
More behavior can be controlled with [environmental variables](https://grafana.com/docs/grafana/latest/administration/configuration/#configure-with-environment-variables).
|
||||
|
||||
> NOTE: There is a [known issue](https://github.com/sourcegraph/sourcegraph/issues/6075) where attempting to edit anything using the Grafana UI will result in a 403 response with "invalid CSRF token".
|
||||
> As a workaround, site admins can [connect to Grafana directly](#accessing-grafana-directly) to make changes using the Grafana UI.
|
||||
|
||||
### Accessing Grafana directly
|
||||
|
||||
Follow the instructions below to access Grafana directly, and add, modify and delete your own dashboards and panels.
|
||||
For most use cases, you can access Grafana [through your Sourcegraph instance](#grafana).
|
||||
Follow the instructions below to access Grafana directly to, for example, edit configuration directly.
|
||||
|
||||
#### Kubernetes
|
||||
> NOTE: Most of the dashboards that Sourcegraph ships with is not configurable through the Grafana UI.
|
||||
> In general, we recommend [these configuration methods instead](#grafana-configuration).
|
||||
|
||||
If you're using the [Kubernetes cluster deployment
|
||||
option](https://github.com/sourcegraph/deploy-sourcegraph), you can access Grafana directly using
|
||||
Kubernetes port forwarding to your local machine:
|
||||
If you are using the [Kubernetes deployment option](../install/kubernetes/index.md), you can access Grafana directly using Kubernetes port forwarding to your local machine:
|
||||
|
||||
|
||||
```
|
||||
```sh
|
||||
kubectl port-forward svc/grafana 3370:30070
|
||||
```
|
||||
|
||||
Now visit http://localhost:3370/-/debug/grafana.
|
||||
Grafana will be available http://localhost:3370/-/debug/grafana.
|
||||
|
||||
#### Single-container server deployments
|
||||
|
||||
For simplicity, Grafana does not require authentication, as the port binding of 3370 is restricted to connections from localhost only.
|
||||
|
||||
Therefore, if accessing Grafana locally, the URL will be http://localhost:3370/-/debug/grafana. If Sourcegraph is deployed to a remote server, then access via an SSH tunnel using a tool
|
||||
such as [sshuttle](https://github.com/sshuttle/sshuttle) is required to establish a secure connection to Grafana.
|
||||
If you are using [Docker](../install/docker/index.md) or the [docker-compose deployment option](../install/index.md), Grafana is available locally at http://localhost:3370/-/debug/grafana without any additional setup.
|
||||
If Sourcegraph is deployed to a remote server, then access via an SSH tunnel using a tool such as [sshuttle](https://github.com/sshuttle/sshuttle) is required to establish a secure connection to Grafana.
|
||||
To access the remote server using `sshuttle` from your local machine:
|
||||
|
||||
```bash
|
||||
sshuttle -r user@host 0/0
|
||||
```
|
||||
|
||||
Then simply visit http://host:3370 in your browser.
|
||||
Grafana will be available at http://host:3370/-/debug/grafana.
|
||||
|
||||
#### Configuration
|
||||
> WARNING: Our Grafana instance runs in anonymous mode with all authentication turned off, since we rely on Sourcegraph's built-in authentication.
|
||||
> Please be careful when exposing it directly to external traffic.
|
||||
|
||||
Sourcegraph runs a slightly customized image of Grafana, which includes a standard Grafana
|
||||
installation initialized with Sourcegraph-specific dashboard definitions.
|
||||
## Prometheus
|
||||
|
||||
> NOTE: Our Grafana instance runs in anonymous mode with all authentication turned off. Please be careful when exposing it to external traffic.
|
||||
Prometheus is a monitoring tool that collects application- and system-level metrics over time and makes these accessible through a robust query language.
|
||||
|
||||
A directory containing dashboard JSON specifications can be mounted in the Docker container at
|
||||
`/sg_grafana_additional_dashboards`. Changes to files in that directory will be detected
|
||||
automatically while Grafana is running.
|
||||
For most use cases, you can query Prometheus through [Grafana](#grafana) using Grafana's Explore panel, available at `/-/debug/grafana/explore` on your Sourcegraph instance, or simply rely on the dashboards we ship.
|
||||
|
||||
More behavior can be controlled with
|
||||
[environmental variables](https://grafana.com/docs/installation/configuration/).
|
||||
### Available metrics
|
||||
|
||||
### FAQ
|
||||
#### High-level alerting metrics
|
||||
|
||||
#### Can I consume Sourcegraph's metrics in my own monitoring system (Datadog, New Relic, etc.)?
|
||||
Sourcegraph's metrics include a single high-level metric `alert_count` which indicates the number of `level=critical` and `level=warning` alerts each service has fired over time for each Sourcegraph service. This is the same metric presented on the **Overview** Grafana dashboard.
|
||||
|
||||
Sourcegraph provides an [HTTP API](alerting_custom_consumption.md) and [high-level alerting metrics](metrics_guide.md) which you can integrate into your own monitoring system.
|
||||
We provide [built-in alerting](./alerting.md) for these metrics. Refer to our [alert solutions reference](./alert_solutions.md) for details on specific alerts metrics.
|
||||
|
||||
While it is technically possible to consume all of Sourcegraph's metrics in an external system, our recommendation is to utilize the builtin monitoring tools and configure Sourcegraph to [send alerts to your own PagerDuty, Slack, email, etc.](alerting.md). Metrics and thresholds can change with each release, therefore manually defining the alerts required to monitor Sourcegraph's health is not recommended. Sourcegraph automatically updates the dashboards and alerts on each release to ensure the displayed information is up-to-date.
|
||||
**Description:** The number of alerts each service has fired and their severity level. The severity levels are defined as follows:
|
||||
|
||||
Other monitoring systems that support Prometheus scraping (for example, Datadog and New Relic) or [Prometheus federation](https://prometheus.io/docs/prometheus/latest/federation/) can be configured to federate Sourcegraph's [high-level alerting metric](metrics_guide.md). For information on how to configure those systems, please check your provider's documentation.
|
||||
- `critical`: something is _definitively_ wrong with Sourcegraph. We suggest using a high-visibility notification channel for these alerts.
|
||||
- **Examples:** Database inaccessible, running out of disk space, running out of memory.
|
||||
- **Suggested action:** Page a site administrator to investigate.
|
||||
- `warning`: something _could_ be wrong with Sourcegraph. We suggest checking in on these periodically, or using a notification channel that will not bother anyone if it is spammed. Over time, as warning alerts become stable and reliable across many Sourcegraph deployments, they will also be promoted to critical alerts in an update by Sourcegraph.
|
||||
- **Examples:** High latency, high search timeouts.
|
||||
- **Suggested action:** Email a site administrator to investigate and monitor when convenient, and please let us know so that we can improve them.
|
||||
|
||||
**Values:**
|
||||
|
||||
- Although the values of `alert_count` are floating-point numbers, only their whole numbers have meaning. For example: `0.5` and `0.7` indicate no alerts are firing, while `1.2` indicates exactly one alert is firing and `3.0` indicates exactly three alerts firing.
|
||||
|
||||
**Labels:**
|
||||
|
||||
- `level`: either `critical` or `warning`, as defined above.
|
||||
- `service_name`: the name of the service that fired the alert.
|
||||
- `name`: the name of the alert that the service fired.
|
||||
- `description`: a human-readable description of the alert.
|
||||
|
||||
#### Complete reference
|
||||
|
||||
A complete reference of Sourcegraph's vast set of Prometheus metrics is not yet available. If you are interested in this, please reach out by filing an issue or contacting us at support@sourcegraph.com.
|
||||
|
||||
### Prometheus configuration
|
||||
|
||||
Sourcegraph runs a customized image of Prometheus, which packages a standard Prometheus installation together with rules files and target files tailored to Sourcegraph and quality-of-life integrations such as [the ability to configure alerting from the Sourcegraph web application](./alerting/index.md).
|
||||
|
||||
A directory can be mounted at `/sg_prometheus_add_ons`. It can contain additional config files of two types:
|
||||
|
||||
- rule files which must have the suffix `_rules.yml` in their filename (ie `gitserver_rules.yml`)
|
||||
- target files which must have the suffix `_targets.yml` in their filename (ie `local_targets.yml`)
|
||||
|
||||
[Rule files](https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/)
|
||||
and [target files](https://prometheus.io/docs/guides/file-sd/) must use the latest Prometheus 2.x syntax.
|
||||
|
||||
The environment variable `PROMETHEUS_ADDITIONAL_FLAGS` can be used to pass on additional flags to the `prometheus` executable running in the container.
|
||||
|
||||
### Accessing Prometheus directly
|
||||
|
||||
Most of the time, Sourcegraph site admins will monitor and query key metrics through [Grafana](#grafana), rather than through Prometheus directly.
|
||||
Grafana also provides the dashboards that monitor the standard metrics that indicate the health of the instance.
|
||||
Follow the instructions below to access Prometheus directly instead.
|
||||
|
||||
If you are using the [Kubernetes deployment option](../install/kubernetes/index.md), port-forward the Prometheus service:
|
||||
|
||||
```sh
|
||||
kubectl port-forward svc/prometheus 9090:30090
|
||||
```
|
||||
|
||||
If you are using [Docker](../install/docker/index.md) or the [docker-compose deployment option](../install/index.md), you will need to restart the Sourcegraph container
|
||||
with a flag `--publish 9090:9090` in the `docker run` command.
|
||||
|
||||
Prometheus will be available http://localhost:9090.
|
||||
|
||||
## Using a cutom monitoring system
|
||||
|
||||
Please refer to our FAQ item, ["Can I consume Sourcegraph's metrics in my own monitoring system (Datadog, New Relic, etc.)"](../faq.md#can-i-consume-sourcegraph-s-metrics-in-my-own-monitoring-system-datadog-new-relic-etc).
|
||||
|
||||
@ -1,50 +0,0 @@
|
||||
# Metrics guide
|
||||
|
||||
## High-level alerting metrics
|
||||
|
||||
Sourcegraph's metrics include a single high-level metric `alert_count` which indicates the number of `level=critical` and `level=warning` alerts each service has fired over time for each Sourcegraph service. This is the same metric presented on the **Overview** Grafana dashboard:
|
||||
|
||||

|
||||
|
||||
To set up notifications for these alerts, see: [alerting](alerting.md).
|
||||
|
||||
### `alert_count`
|
||||
|
||||
**Description:** The number of alerts each service has fired and their severity level. The severity levels are defined as follows:
|
||||
|
||||
- `critical`: something is _definitively_ wrong with Sourcegraph. We suggest using a high-visibility notification channel for these alerts.
|
||||
- **Examples:** Database inaccessible, running out of disk space, running out of memory.
|
||||
- **Suggested action:** Page a site administrator to investigate.
|
||||
- `warning`: something _could_ be wrong with Sourcegraph. We suggest checking in on these periodically, or using a notification channel that will not bother anyone if it is spammed. Over time, as warning alerts become stable and reliable across many Sourcegraph deployments, they will also be promoted to critical alerts in an update by Sourcegraph.
|
||||
- **Examples:** High latency, high search timeouts.
|
||||
- **Suggested action:** Email a site administrator to investigate and monitor when convenient, and please let us know so that we can improve them.
|
||||
|
||||
**Values:**
|
||||
|
||||
- Although the values of `alert_count` are floating-point numbers, only their whole numbers have meaning. For example: `0.5` and `0.7` indicate no alerts are firing, while `1.2` indicates exactly one alert is firing and `3.0` indicates exactly three alerts firing.
|
||||
|
||||
**Labels:**
|
||||
|
||||
- `level`: either `critical` or `warning`, as defined above.
|
||||
- `service_name`: the name of the service that fired the alert, one of the following constants:
|
||||
- `"frontend"`
|
||||
- `"github-proxy"`
|
||||
- `"gitserver"`
|
||||
- `"precise-code-intel"`
|
||||
- `"query-runner"`
|
||||
- `"repo-updater"`
|
||||
- `"searcher"`
|
||||
- `"symbols"`
|
||||
- `"zoekt-indexserver"`
|
||||
- `"zoekt-webserver"`
|
||||
- `"syntect-server"`
|
||||
- `name`: the name of the alert that the service fired (chosen by the service)
|
||||
- `description`: a human-readable description of the alert
|
||||
|
||||
**Examples:**
|
||||
|
||||
To get examples of how you might consume this metric in your own alerting system, see: [Custom consumption of Sourcegraph alerts](alerting_custom_consumption.md).
|
||||
|
||||
## Complete reference
|
||||
|
||||
A complete reference of Sourcegraph's vast set of Prometheus metrics is not yet available. If you are interested in this, please reach out by filing an issue or contacting us at support@sourcegraph.com.
|
||||
@ -8,6 +8,7 @@ import (
|
||||
|
||||
amconfig "github.com/prometheus/alertmanager/config"
|
||||
commoncfg "github.com/prometheus/common/config"
|
||||
|
||||
"github.com/sourcegraph/sourcegraph/schema"
|
||||
)
|
||||
|
||||
@ -23,6 +24,8 @@ const (
|
||||
colorGood = "#00FF00" // green
|
||||
)
|
||||
|
||||
const alertSolutionsURL = "https://docs.sourcegraph.com/admin/observability/alert_solutions"
|
||||
|
||||
// commonLabels defines the set of labels we group alerts by, such that each alert falls in a unique group.
|
||||
// These labels are available in Alertmanager templates as fields of `.CommonLabels`.
|
||||
//
|
||||
@ -32,11 +35,13 @@ const (
|
||||
// When changing this, make sure to update the webhook body documentation in /doc/admin/observability/alerting.md
|
||||
var commonLabels = []string{"alertname", "level", "service_name", "name", "owner", "description"}
|
||||
|
||||
// Static alertmanager templates
|
||||
// Static alertmanager templates. Templating reference: https://prometheus.io/docs/alerting/latest/notifications
|
||||
//
|
||||
// All `.CommonLabels` labels used in these templates should be included in `route.GroupByStr` in order for them to be available.
|
||||
var (
|
||||
// Alertmanager notification template reference: https://prometheus.io/docs/alerting/latest/notifications
|
||||
// All labels used in these templates should be included in route.GroupByStr
|
||||
alertSolutionsURLTemplate = `https://docs.sourcegraph.com/admin/observability/alert_solutions#{{ .CommonLabels.service_name }}-{{ .CommonLabels.name | reReplaceAll "(_low|_high)$" "" | reReplaceAll "_" "-" }}`
|
||||
// observableDocAnchorTemplate must match anchors generated in `monitoring/monitoring/documentation.go`.
|
||||
observableDocAnchorTemplate = `{{ .CommonLabels.service_name }}-{{ .CommonLabels.name | reReplaceAll "(_low|_high)$" "" | reReplaceAll "_" "-" }}`
|
||||
alertSolutionsURLTemplate = fmt.Sprintf(`%s#%s`, alertSolutionsURL, observableDocAnchorTemplate)
|
||||
|
||||
// Title templates
|
||||
firingTitleTemplate = "[{{ .CommonLabels.level | toUpper }}] {{ .CommonLabels.description }}"
|
||||
|
||||
@ -7,6 +7,14 @@ This page primarily documents the [generator's current capabilities](#features)
|
||||
|
||||
To learn about how to find, add, and use monitoring, see the [Sourcegraph monitoring developer guide](https://about.sourcegraph.com/handbook/engineering/observability/monitoring).
|
||||
|
||||
- [Usage](#usage)
|
||||
- [Features](#features)
|
||||
- [Documentation generation](#documentation-generation)
|
||||
- [Grafana integration](#grafana-integration)
|
||||
- [Prometheus integration](#prometheus-integration)
|
||||
- [Alertmanager integration](#alertmanager-integration)
|
||||
- [Development](#development)
|
||||
|
||||
## Usage
|
||||
|
||||
From this directory:
|
||||
@ -22,9 +30,13 @@ Other configuration options can be customized via flags declared in [`main.go`](
|
||||
|
||||
### Documentation generation
|
||||
|
||||
The generator automatically creates documentation from monitoring definitions, such as [alert solutions references](https://docs.sourcegraph.com/admin/observability/alert_solutions), that customers and engineers can reference.
|
||||
The generator automatically creates documentation from monitoring definitions that customers and engineers can reference.
|
||||
These include:
|
||||
|
||||
Links to generated documentation can be provided in our other generated integrations - for example, [Slack alerts](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) will provide a link to the appropriate alert solutions entry.
|
||||
- [Alert solutions reference](https://docs.sourcegraph.com/admin/observability/alert_solutions)
|
||||
- [Dashboards reference](https://docs.sourcegraph.com/admin/observability/dashboards)
|
||||
|
||||
Links to generated documentation can be provided in our other generated integrations - for example, [Slack alerts](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) will provide a link to the appropriate alert solutions entry, and [Grafana panels](#grafana-integration) will link to the appropriate dashboards reference entry.
|
||||
|
||||
### Grafana integration
|
||||
|
||||
@ -38,20 +50,27 @@ It also takes care of the following:
|
||||
- Threshold lines for alerts of all levels are rendered in graphs
|
||||
- Formatting of units, labels, and more (using either the defaults, or the [`ObservablePanelOptions` API](./monitoring/README.md#type-observablepaneloptions))
|
||||
- Maintaining a uniform look and feel across all dashboards
|
||||
- Providing links to [generated documentation](#documentation-generation)
|
||||
|
||||
Links to generated documentation can be provided in our other generated integrations - for example, [Slack alerts](https://docs.sourcegraph.com/admin/observability/alerting#setting-up-alerting) will provide a link to the appropriate service's dashboard.
|
||||
|
||||
### Prometheus integration
|
||||
|
||||
The generator automatically generates and ships Prometheus recording rules and alerts within the [Sourcegraph Prometheus distribution](https://about.sourcegraph.com/handbook/engineering/observability/monitoring_architecture#sourcegraph-prometheus). This includes the [`alert_count` recording rules](https://about.sourcegraph.com/handbook/engineering/observability/monitoring_architecture#alert-count-metrics) and native Prometheus alerts, all with appropriate and consistent labels.
|
||||
The generator automatically generates and ships Prometheus recording rules and alerts within the [Sourcegraph Prometheus distribution](https://about.sourcegraph.com/handbook/engineering/observability/monitoring_architecture#sourcegraph-prometheus).
|
||||
This include the following, all with appropriate and consistent labels:
|
||||
|
||||
- [`alert_count` recording rules](https://about.sourcegraph.com/handbook/engineering/observability/monitoring_architecture#alert-count-metrics)
|
||||
- Native Prometheus alerts, leveraged by our [Alertmanager integration](#alertmanager-integration)
|
||||
|
||||
Generated Prometheus recording rules are leveraged by the [Grafana integration](#grafana-integration).
|
||||
|
||||
### Alertmanager integration
|
||||
|
||||
The generator's [Prometheus integration](#prometheus-integration) is a critical part of the [Sourcegraph's alerting capabilities](https://about.sourcegraph.com/handbook/engineering/observability/monitoring_architecture#alert-notifications), which handles alert routing by level and formatting of alert messages to include links to [documentation](#documentation-generation) and [dashboards](#grafana-integration). Learn more about using Sourcegraph alerting in the [alerting documentation](https://docs.sourcegraph.com/admin/observability/alerting).
|
||||
The generator's [Prometheus integration](#prometheus-integration) is a critical part of the [Sourcegraph's alerting capabilities](https://about.sourcegraph.com/handbook/engineering/observability/monitoring_architecture#alert-notifications), which handles alert routing by level and formatting of alert messages to include links to [documentation](#documentation-generation) and [dashboards](#grafana-integration).
|
||||
Learn more about using Sourcegraph alerting in the [alerting documentation](https://docs.sourcegraph.com/admin/observability/alerting).
|
||||
This is possible due to the labels generated by the [Prometheus integration](#prometheus-integration)
|
||||
|
||||
At Sourcegraph, routing based on team ownership (as defined by [`ObservableOwner`](./monitoring/README.md#type-observableowner)) is used to route customer support requests and [on-call events through OpsGenie](https://about.sourcegraph.com/handbook/engineering/incidents/on_call).
|
||||
At Sourcegraph, extended routing based on team ownership (as defined by [`ObservableOwner`](./monitoring/README.md#type-observableowner)) is also used to route customer support requests and [on-call events through OpsGenie](https://about.sourcegraph.com/handbook/engineering/incidents/on_call).
|
||||
|
||||
## Development
|
||||
|
||||
@ -62,6 +81,6 @@ The Sourcegraph monitoring generator consists of three components:
|
||||
This is where the all service monitoring definitions lives.
|
||||
If you are editing monitoring, this is probably where you want to look - see the [Sourcegraph monitoring developer guide](https://about.sourcegraph.com/handbook/engineering/observability/monitoring).
|
||||
- _Generator_, defined in the nested [`monitoring/monitoring` package](./monitoring/README.md) package.
|
||||
This is where the API for service monitoring definitions is defined, as well as the generator code.
|
||||
This is where the API for service monitoring definitions is defined, as well as the generator code that provides the [above features](#features).
|
||||
|
||||
All features and capabilities for developed for the generator should align with the [Sourcegraph monitoring pillars](https://about.sourcegraph.com/handbook/engineering/observability/monitoring_pillars).
|
||||
|
||||
@ -48,24 +48,24 @@ func ExecutorQueue() *monitoring.Container {
|
||||
},
|
||||
{
|
||||
{
|
||||
Name: "codeintel_active_executors",
|
||||
Description: "active executors processing codeintel jobs",
|
||||
Query: `max(src_apiworker_apiserver_executors_total{queue="codeintel"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("executors"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_active_executors",
|
||||
Description: "active executors processing codeintel jobs",
|
||||
Query: `max(src_apiworker_apiserver_executors_total{queue="codeintel"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("executors"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "codeintel_active_jobs",
|
||||
Description: "active jobs",
|
||||
Query: `sum(src_apiworker_apiserver_jobs_total{queue="codeintel"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_active_jobs",
|
||||
Description: "active jobs",
|
||||
Query: `sum(src_apiworker_apiserver_jobs_total{queue="codeintel"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
},
|
||||
},
|
||||
|
||||
@ -565,34 +565,34 @@ func Frontend() *monitoring.Container {
|
||||
PossibleSolutions: "none",
|
||||
},
|
||||
{
|
||||
Name: "codeintel_upload_records_removed",
|
||||
Description: "upload records expired or deleted every 5m",
|
||||
Query: `sum(increase(src_codeintel_background_upload_records_removed_total{job=~"(sourcegraph-)?frontend"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("uploads removed"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_upload_records_removed",
|
||||
Description: "upload records expired or deleted every 5m",
|
||||
Query: `sum(increase(src_codeintel_background_upload_records_removed_total{job=~"(sourcegraph-)?frontend"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("uploads removed"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "codeintel_index_records_removed",
|
||||
Description: "index records expired or deleted every 5m",
|
||||
Query: `sum(increase(src_codeintel_background_index_records_removed_total{job=~"(sourcegraph-)?frontend"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("indexes removed"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_index_records_removed",
|
||||
Description: "index records expired or deleted every 5m",
|
||||
Query: `sum(increase(src_codeintel_background_index_records_removed_total{job=~"(sourcegraph-)?frontend"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("indexes removed"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "codeintel_lsif_data_removed",
|
||||
Description: "data for unreferenced upload records removed every 5m",
|
||||
Query: `sum(increase(src_codeintel_background_uploads_purged_total{job=~"(sourcegraph-)?frontend"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("uploads purged"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_lsif_data_removed",
|
||||
Description: "data for unreferenced upload records removed every 5m",
|
||||
Query: `sum(increase(src_codeintel_background_uploads_purged_total{job=~"(sourcegraph-)?frontend"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("uploads purged"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
},
|
||||
{
|
||||
@ -645,14 +645,14 @@ func Frontend() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "codeintel_indexing_99th_percentile_duration",
|
||||
Description: "99th percentile successful indexing operation duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_codeintel_indexing_duration_seconds_bucket{job=~"(sourcegraph-)?frontend"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("operations").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_indexing_99th_percentile_duration",
|
||||
Description: "99th percentile successful indexing operation duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_codeintel_indexing_duration_seconds_bucket{job=~"(sourcegraph-)?frontend"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("operations").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "codeintel_indexing_errors",
|
||||
|
||||
@ -33,13 +33,16 @@ func GitServer() *monitoring.Container {
|
||||
},
|
||||
{
|
||||
Name: "running_git_commands",
|
||||
Description: "running git commands (signals load)",
|
||||
Description: "running git commands",
|
||||
Query: "max(src_gitserver_exec_running)",
|
||||
DataMayNotExist: true,
|
||||
Warning: monitoring.Alert().GreaterOrEqual(50).For(2 * time.Minute),
|
||||
Critical: monitoring.Alert().GreaterOrEqual(100).For(5 * time.Minute),
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("running commands"),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
Interpretation: `
|
||||
A high value signals load.
|
||||
`,
|
||||
PossibleSolutions: `
|
||||
- **Check if the problem may be an intermittent and temporary peak** using the "Container monitoring" section at the bottom of the Git Server dashboard.
|
||||
- **Single container deployments:** Consider upgrading to a [Docker Compose deployment](../install/docker-compose/migrate.md) which offers better scalability and resource isolation.
|
||||
@ -77,16 +80,19 @@ func GitServer() *monitoring.Container {
|
||||
}, {
|
||||
{
|
||||
Name: "echo_command_duration_test",
|
||||
Description: "echo command duration test",
|
||||
Description: "echo test command duration",
|
||||
Query: "max(src_gitserver_echo_duration_seconds)",
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("running commands").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
PossibleSolutions: `
|
||||
- **Query a graph for individual commands** using 'sum by (cmd)(src_gitserver_exec_running)' in Grafana ('/-/debug/grafana') to see if a command might be spiking in frequency.
|
||||
- **Check if the problem may be an intermittent and temporary peak** using the "Container monitoring" section at the bottom of the Git Server dashboard.
|
||||
- **Single container deployments:** Consider upgrading to a [Docker Compose deployment](../install/docker-compose/migrate.md) which offers better scalability and resource isolation.
|
||||
Interpretation: `
|
||||
A high value here likely indicates a problem, especially if consistently high.
|
||||
You can query for individual commands using 'sum by (cmd)(src_gitserver_exec_running)' in Grafana ('/-/debug/grafana') to see if a specific Git Server command might be spiking in frequency.
|
||||
|
||||
If this value is consistently high, consider the following:
|
||||
|
||||
- **Single container deployments:** Upgrade to a [Docker Compose deployment](../install/docker-compose/migrate.md) which offers better scalability and resource isolation.
|
||||
- **Kubernetes and Docker Compose:** Check that you are running a similar number of git server replicas and that their CPU/memory limits are allocated according to what is shown in the [Sourcegraph resource estimator](../install/resource_estimator.md).
|
||||
`,
|
||||
},
|
||||
@ -126,15 +132,15 @@ func GitServer() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
shared.ProvisioningCPUUsageLongTerm("gitserver", monitoring.ObservableOwnerCloud),
|
||||
// gitserver generally uses up all the memory it gets, so
|
||||
// alerting on high memory usage is not very useful
|
||||
shared.ProvisioningMemoryUsageLongTerm("gitserver", monitoring.ObservableOwnerCloud).WithNoAlerts(),
|
||||
shared.ProvisioningMemoryUsageLongTerm("gitserver", monitoring.ObservableOwnerCloud).WithNoAlerts(`
|
||||
Git Server is expected to use up all the memory it is provided.
|
||||
`),
|
||||
},
|
||||
{
|
||||
shared.ProvisioningCPUUsageShortTerm("gitserver", monitoring.ObservableOwnerCloud),
|
||||
// gitserver generally uses up all the memory it gets, so
|
||||
// alerting on high memory usage is not very useful
|
||||
shared.ProvisioningMemoryUsageShortTerm("gitserver", monitoring.ObservableOwnerCloud).WithNoAlerts(),
|
||||
shared.ProvisioningMemoryUsageShortTerm("gitserver", monitoring.ObservableOwnerCloud).WithNoAlerts(`
|
||||
Git Server is expected to use up all the memory it is provided.
|
||||
`),
|
||||
},
|
||||
},
|
||||
},
|
||||
|
||||
@ -16,24 +16,24 @@ func PreciseCodeIntelIndexer() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "codeintel_job_99th_percentile_duration",
|
||||
Description: "99th percentile successful job duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_executor_queue_processor_duration_seconds_bucket{queue="codeintel"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_job_99th_percentile_duration",
|
||||
Description: "99th percentile successful job duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_executor_queue_processor_duration_seconds_bucket{queue="codeintel"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "codeintel_active_handlers",
|
||||
Description: "active handlers processing jobs",
|
||||
Query: `sum(src_executor_queue_processor_handlers{queue="codeintel"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("handlers"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "codeintel_active_handlers",
|
||||
Description: "active handlers processing jobs",
|
||||
Query: `sum(src_executor_queue_processor_handlers{queue="codeintel"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("handlers"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "codeintel_job_errors",
|
||||
@ -80,14 +80,14 @@ func PreciseCodeIntelIndexer() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "executor_setup_command_99th_percentile_duration",
|
||||
Description: "99th percentile successful setup command duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_apiworker_command_duration_seconds_bucket{job="sourcegraph-code-intel-indexers", op=~"setup.*"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("commands").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "executor_setup_command_99th_percentile_duration",
|
||||
Description: "99th percentile successful setup command duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_apiworker_command_duration_seconds_bucket{job="sourcegraph-code-intel-indexers", op=~"setup.*"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("commands").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "executor_setup_command_errors",
|
||||
@ -102,14 +102,14 @@ func PreciseCodeIntelIndexer() *monitoring.Container {
|
||||
},
|
||||
{
|
||||
{
|
||||
Name: "executor_exec_command_99th_percentile_duration",
|
||||
Description: "99th percentile successful exec command duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_apiworker_command_duration_seconds_bucket{job="sourcegraph-code-intel-indexers", op=~"exec.*"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("commands").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "executor_exec_command_99th_percentile_duration",
|
||||
Description: "99th percentile successful exec command duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_apiworker_command_duration_seconds_bucket{job="sourcegraph-code-intel-indexers", op=~"exec.*"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("commands").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "executor_exec_command_errors",
|
||||
@ -124,14 +124,14 @@ func PreciseCodeIntelIndexer() *monitoring.Container {
|
||||
},
|
||||
{
|
||||
{
|
||||
Name: "executor_teardown_command_99th_percentile_duration",
|
||||
Description: "99th percentile successful teardown command duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_apiworker_teardown_command_duration_seconds_bucket{job="sourcegraph-code-intel-indexers", op=~"teardown.*"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("commands").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "executor_teardown_command_99th_percentile_duration",
|
||||
Description: "99th percentile successful teardown command duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_apiworker_teardown_command_duration_seconds_bucket{job="sourcegraph-code-intel-indexers", op=~"teardown.*"}[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("commands").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "executor_teardown_command_errors",
|
||||
|
||||
@ -48,24 +48,24 @@ func PreciseCodeIntelWorker() *monitoring.Container {
|
||||
},
|
||||
{
|
||||
{
|
||||
Name: "active_workers",
|
||||
Description: "active workers processing uploads",
|
||||
Query: `max(up{job="precise-code-intel-worker"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("workers"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "active_workers",
|
||||
Description: "active workers processing uploads",
|
||||
Query: `max(up{job="precise-code-intel-worker"})`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("workers"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "active_jobs",
|
||||
Description: "active jobs",
|
||||
Query: `sum(src_codeintel_upload_queue_processor_handlers)`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "active_jobs",
|
||||
Description: "active jobs",
|
||||
Query: `sum(src_codeintel_upload_queue_processor_handlers)`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs"),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
},
|
||||
},
|
||||
@ -75,14 +75,14 @@ func PreciseCodeIntelWorker() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "job_99th_percentile_duration",
|
||||
Description: "99th percentile successful job duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_codeintel_upload_queue_processor_duration_seconds_bucket[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
PossibleSolutions: "none",
|
||||
Name: "job_99th_percentile_duration",
|
||||
Description: "99th percentile successful job duration over 5m",
|
||||
Query: `histogram_quantile(0.99, sum by (le)(rate(src_codeintel_upload_queue_processor_duration_seconds_bucket[5m])))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("jobs").Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCodeIntel,
|
||||
Interpretation: "none",
|
||||
},
|
||||
},
|
||||
},
|
||||
|
||||
@ -30,14 +30,17 @@ func RepoUpdater() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "syncer_sync_last_time",
|
||||
Description: "time since last sync",
|
||||
Query: `max(timestamp(vector(time()))) - max(src_repoupdater_syncer_sync_last_time)`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
PossibleSolutions: "Make sure there are external services added with valid tokens",
|
||||
Name: "syncer_sync_last_time",
|
||||
Description: "time since last sync",
|
||||
Query: `max(timestamp(vector(time()))) - max(src_repoupdater_syncer_sync_last_time)`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().Unit(monitoring.Seconds),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
Interpretation: `
|
||||
A high value here indicates issues synchronizing repository permissions.
|
||||
If the value is persistently high, make sure all external services have valid tokens.
|
||||
`,
|
||||
},
|
||||
{
|
||||
Name: "src_repoupdater_max_sync_backoff",
|
||||
@ -165,14 +168,17 @@ func RepoUpdater() *monitoring.Container {
|
||||
PossibleSolutions: "Check repo-updater logs. This is expected to fire if there are no user added code hosts",
|
||||
},
|
||||
{
|
||||
Name: "sched_manual_fetch",
|
||||
Description: "repositories scheduled due to user traffic",
|
||||
Query: `sum(rate(src_repoupdater_sched_manual_fetch[1m]))`,
|
||||
NoAlert: true,
|
||||
DataMayNotExist: true,
|
||||
PanelOptions: monitoring.PanelOptions().Unit(monitoring.Number),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
PossibleSolutions: "Check repo-updater logs. This is expected to fire if there are no user added code hosts",
|
||||
Name: "sched_manual_fetch",
|
||||
Description: "repositories scheduled due to user traffic",
|
||||
Query: `sum(rate(src_repoupdater_sched_manual_fetch[1m]))`,
|
||||
NoAlert: true,
|
||||
DataMayNotExist: true,
|
||||
PanelOptions: monitoring.PanelOptions().Unit(monitoring.Number),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
Interpretation: `
|
||||
Check repo-updater logs if this value is persistently high.
|
||||
This does not indicate anything if there are no user added code hosts.
|
||||
`,
|
||||
},
|
||||
},
|
||||
{
|
||||
|
||||
@ -73,8 +73,9 @@ var (
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("{{name}}"),
|
||||
Owner: owner,
|
||||
PossibleSolutions: `
|
||||
- Refer to your OS or cloud provider's documentation for how to increase inodes.
|
||||
- **Kubernetes:** consider provisioning more machines with less resources.`,
|
||||
- Refer to your OS or cloud provider's documentation for how to increase inodes.
|
||||
- **Kubernetes:** consider provisioning more machines with less resources.
|
||||
`,
|
||||
}
|
||||
}
|
||||
)
|
||||
|
||||
@ -16,46 +16,46 @@ func SyntectServer() *monitoring.Container {
|
||||
Rows: []monitoring.Row{
|
||||
{
|
||||
{
|
||||
Name: "syntax_highlighting_errors",
|
||||
Description: "syntax highlighting errors every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="error"}[5m])) / sum(increase(src_syntax_highlighting_requests[5m])) * 100`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("error").Unit(monitoring.Percentage),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
PossibleSolutions: "none",
|
||||
Name: "syntax_highlighting_errors",
|
||||
Description: "syntax highlighting errors every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="error"}[5m])) / sum(increase(src_syntax_highlighting_requests[5m])) * 100`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("error").Unit(monitoring.Percentage),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "syntax_highlighting_timeouts",
|
||||
Description: "syntax highlighting timeouts every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="timeout"}[5m])) / sum(increase(src_syntax_highlighting_requests[5m])) * 100`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("timeout").Unit(monitoring.Percentage),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
PossibleSolutions: "none",
|
||||
Name: "syntax_highlighting_timeouts",
|
||||
Description: "syntax highlighting timeouts every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="timeout"}[5m])) / sum(increase(src_syntax_highlighting_requests[5m])) * 100`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("timeout").Unit(monitoring.Percentage),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
Interpretation: "none",
|
||||
},
|
||||
},
|
||||
{
|
||||
{
|
||||
Name: "syntax_highlighting_panics",
|
||||
Description: "syntax highlighting panics every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="panic"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("panic"),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
PossibleSolutions: "none",
|
||||
Name: "syntax_highlighting_panics",
|
||||
Description: "syntax highlighting panics every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="panic"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("panic"),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
Interpretation: "none",
|
||||
},
|
||||
{
|
||||
Name: "syntax_highlighting_worker_deaths",
|
||||
Description: "syntax highlighter worker deaths every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="hss_worker_timeout"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("worker death"),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
PossibleSolutions: "none",
|
||||
Name: "syntax_highlighting_worker_deaths",
|
||||
Description: "syntax highlighter worker deaths every 5m",
|
||||
Query: `sum(increase(src_syntax_highlighting_requests{status="hss_worker_timeout"}[5m]))`,
|
||||
DataMayNotExist: true,
|
||||
NoAlert: true,
|
||||
PanelOptions: monitoring.PanelOptions().LegendFormat("worker death"),
|
||||
Owner: monitoring.ObservableOwnerCloud,
|
||||
Interpretation: "none",
|
||||
},
|
||||
},
|
||||
},
|
||||
|
||||
@ -20,7 +20,7 @@ To learn more about the generator\, see the top\-level program: https://github.c
|
||||
- [type Group](<#type-group>)
|
||||
- [type Observable](<#type-observable>)
|
||||
- [func (o Observable) WithCritical(a *ObservableAlertDefinition) Observable](<#func-observable-withcritical>)
|
||||
- [func (o Observable) WithNoAlerts() Observable](<#func-observable-withnoalerts>)
|
||||
- [func (o Observable) WithNoAlerts(interpretation string) Observable](<#func-observable-withnoalerts>)
|
||||
- [func (o Observable) WithWarning(a *ObservableAlertDefinition) Observable](<#func-observable-withwarning>)
|
||||
- [type ObservableAlertDefinition](<#type-observablealertdefinition>)
|
||||
- [func Alert() *ObservableAlertDefinition](<#func-alert>)
|
||||
@ -40,7 +40,7 @@ To learn more about the generator\, see the top\-level program: https://github.c
|
||||
- [type UnitType](<#type-unittype>)
|
||||
|
||||
|
||||
## func [Generate](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/generator.go#L45>)
|
||||
## func [Generate](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/generator.go#L40>)
|
||||
|
||||
```go
|
||||
func Generate(logger log15.Logger, opts GenerateOptions, containers ...*Container) error
|
||||
@ -71,7 +71,7 @@ type Container struct {
|
||||
}
|
||||
```
|
||||
|
||||
## type [GenerateOptions](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/generator.go#L30-L42>)
|
||||
## type [GenerateOptions](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/generator.go#L25-L37>)
|
||||
|
||||
GenerateOptions declares options for the monitoring generator\.
|
||||
|
||||
@ -91,7 +91,7 @@ type GenerateOptions struct {
|
||||
}
|
||||
```
|
||||
|
||||
## type [Group](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L447-L462>)
|
||||
## type [Group](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L461-L476>)
|
||||
|
||||
Group describes a group of observable information about a container\.
|
||||
|
||||
@ -116,7 +116,7 @@ type Group struct {
|
||||
}
|
||||
```
|
||||
|
||||
## type [Observable](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L510-L608>)
|
||||
## type [Observable](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L524-L642>)
|
||||
|
||||
Observable describes a metric about a container that can be observed\. For example\, memory usage\.
|
||||
|
||||
@ -156,7 +156,7 @@ type Observable struct {
|
||||
//
|
||||
Description string
|
||||
|
||||
// Owner indicates the team that owns any alerts associated with this Observable.
|
||||
// Owner indicates the team that owns this Observable (including its alerts and maintainence).
|
||||
Owner ObservableOwner
|
||||
|
||||
// Query is the actual Prometheus query that should be observed.
|
||||
@ -183,17 +183,23 @@ type Observable struct {
|
||||
DataMayNotBeNaN bool
|
||||
|
||||
// Warning and Critical alert definitions.
|
||||
// Consider adding at least a Warning or Critical alert to each Observable to make it easy to
|
||||
// identify when the target of this metric is missbehaving.
|
||||
// Consider adding at least a Warning or Critical alert to each Observable to make it
|
||||
// easy to identify when the target of this metric is misbehaving. If no alerts are
|
||||
// provided, NoAlert must be set and Interpretation must be provided.
|
||||
Warning, Critical *ObservableAlertDefinition
|
||||
|
||||
// NoAlerts is used by Observables that don't need any alerts.
|
||||
// We want to be explicit about this to ensure alerting is considered and if we choose not to Alert,
|
||||
// its easy to identify it is an intentional behavior.
|
||||
// NoAlerts must be set by Observables that do not have any alerts.
|
||||
// This ensures the omission of alerts is intentional. If set to true, an Interpretation
|
||||
// must be provided in place of PossibleSolutions.
|
||||
NoAlert bool
|
||||
|
||||
// PossibleSolutions is Markdown describing possible solutions in the event that the alert is
|
||||
// firing. If there is no clear potential resolution, "none" must be explicitly stated.
|
||||
// PossibleSolutions is Markdown describing possible solutions in the event that the
|
||||
// alert is firing. This field not required if no alerts are attached to this Observable.
|
||||
// If there is no clear potential resolution or there is no alert configured, "none"
|
||||
// must be explicitly stated.
|
||||
//
|
||||
// Use the Interpretation field for additional guidance on understanding this Observable that isn't directly related to solving it.
|
||||
// it, the Interpretation field can be provided as well.
|
||||
//
|
||||
// Contacting support should not be mentioned as part of a possible solution, as it is
|
||||
// communicated elsewhere.
|
||||
@ -216,33 +222,49 @@ type Observable struct {
|
||||
// 2. The indentation in the string literal is removed (based on the last line).
|
||||
// 3. Single quotes become backticks.
|
||||
// 4. The last line (which is all indention) is removed.
|
||||
// 5. Non-list items are converted to a list.
|
||||
//
|
||||
PossibleSolutions string
|
||||
|
||||
// Interpretation is Markdown that can serve as a reference for interpreting this
|
||||
// observable. For example, Interpretation could provide guidance on what sort of
|
||||
// patterns to look for in the observable's graph and document why this observable is
|
||||
// usefule.
|
||||
//
|
||||
// If no alerts are configured for an observable, this field is required. If the
|
||||
// Description is sufficient to capture what this Observable describes, "none" must be
|
||||
// explicitly stated.
|
||||
//
|
||||
// To make writing the Markdown more friendly in Go, string literal processing as
|
||||
// PossibleSolutions is provided, though the output is not converted to a list.
|
||||
Interpretation string
|
||||
|
||||
// PanelOptions describes some options for how to render the metric in the Grafana panel.
|
||||
PanelOptions ObservablePanelOptions
|
||||
}
|
||||
```
|
||||
|
||||
### func \(Observable\) [WithCritical](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L615>)
|
||||
### func \(Observable\) [WithCritical](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L649>)
|
||||
|
||||
```go
|
||||
func (o Observable) WithCritical(a *ObservableAlertDefinition) Observable
|
||||
```
|
||||
|
||||
### func \(Observable\) [WithNoAlerts](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L620>)
|
||||
### func \(Observable\) [WithNoAlerts](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L655>)
|
||||
|
||||
```go
|
||||
func (o Observable) WithNoAlerts() Observable
|
||||
func (o Observable) WithNoAlerts(interpretation string) Observable
|
||||
```
|
||||
|
||||
### func \(Observable\) [WithWarning](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L610>)
|
||||
WithNoAlerts disables alerting on this Observable and sets the given interpretation instead\.
|
||||
|
||||
### func \(Observable\) [WithWarning](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L644>)
|
||||
|
||||
```go
|
||||
func (o Observable) WithWarning(a *ObservableAlertDefinition) Observable
|
||||
```
|
||||
|
||||
## type [ObservableAlertDefinition](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L661-L673>)
|
||||
## type [ObservableAlertDefinition](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L717-L729>)
|
||||
|
||||
ObservableAlertDefinition defines when an alert would be considered firing\.
|
||||
|
||||
@ -252,7 +274,7 @@ type ObservableAlertDefinition struct {
|
||||
}
|
||||
```
|
||||
|
||||
### func [Alert](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L656>)
|
||||
### func [Alert](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L712>)
|
||||
|
||||
```go
|
||||
func Alert() *ObservableAlertDefinition
|
||||
@ -260,25 +282,25 @@ func Alert() *ObservableAlertDefinition
|
||||
|
||||
Alert provides a builder for defining alerting on an Observable\.
|
||||
|
||||
### func \(\*ObservableAlertDefinition\) [For](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L685>)
|
||||
### func \(\*ObservableAlertDefinition\) [For](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L741>)
|
||||
|
||||
```go
|
||||
func (a *ObservableAlertDefinition) For(d time.Duration) *ObservableAlertDefinition
|
||||
```
|
||||
|
||||
### func \(\*ObservableAlertDefinition\) [GreaterOrEqual](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L675>)
|
||||
### func \(\*ObservableAlertDefinition\) [GreaterOrEqual](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L731>)
|
||||
|
||||
```go
|
||||
func (a *ObservableAlertDefinition) GreaterOrEqual(f float64) *ObservableAlertDefinition
|
||||
```
|
||||
|
||||
### func \(\*ObservableAlertDefinition\) [LessOrEqual](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L680>)
|
||||
### func \(\*ObservableAlertDefinition\) [LessOrEqual](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L736>)
|
||||
|
||||
```go
|
||||
func (a *ObservableAlertDefinition) LessOrEqual(f float64) *ObservableAlertDefinition
|
||||
```
|
||||
|
||||
## type [ObservableOwner](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L495>)
|
||||
## type [ObservableOwner](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L509>)
|
||||
|
||||
ObservableOwner denotes a team that owns an Observable\. The current teams are described in the handbook: https://about.sourcegraph.com/company/team/org_chart#engineering
|
||||
|
||||
@ -298,7 +320,7 @@ const (
|
||||
)
|
||||
```
|
||||
|
||||
## type [ObservablePanelOptions](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L740-L746>)
|
||||
## type [ObservablePanelOptions](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L796-L802>)
|
||||
|
||||
ObservablePanelOptions declares options for visualizing an Observable\.
|
||||
|
||||
@ -308,7 +330,7 @@ type ObservablePanelOptions struct {
|
||||
}
|
||||
```
|
||||
|
||||
### func [PanelOptions](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L749>)
|
||||
### func [PanelOptions](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L805>)
|
||||
|
||||
```go
|
||||
func PanelOptions() ObservablePanelOptions
|
||||
@ -316,7 +338,7 @@ func PanelOptions() ObservablePanelOptions
|
||||
|
||||
PanelOptions provides a builder for customizing an Observable visualization\.
|
||||
|
||||
### func \(ObservablePanelOptions\) [Interval](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L786>)
|
||||
### func \(ObservablePanelOptions\) [Interval](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L842>)
|
||||
|
||||
```go
|
||||
func (p ObservablePanelOptions) Interval(ms int) ObservablePanelOptions
|
||||
@ -324,7 +346,7 @@ func (p ObservablePanelOptions) Interval(ms int) ObservablePanelOptions
|
||||
|
||||
Interval declares the panel's interval in milliseconds\.
|
||||
|
||||
### func \(ObservablePanelOptions\) [LegendFormat](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L774>)
|
||||
### func \(ObservablePanelOptions\) [LegendFormat](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L830>)
|
||||
|
||||
```go
|
||||
func (p ObservablePanelOptions) LegendFormat(format string) ObservablePanelOptions
|
||||
@ -332,7 +354,7 @@ func (p ObservablePanelOptions) LegendFormat(format string) ObservablePanelOptio
|
||||
|
||||
LegendFormat sets the panel's legend format\, which may use Go template strings to select labels from the Prometheus query\.
|
||||
|
||||
### func \(ObservablePanelOptions\) [Max](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L767>)
|
||||
### func \(ObservablePanelOptions\) [Max](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L823>)
|
||||
|
||||
```go
|
||||
func (p ObservablePanelOptions) Max(max float64) ObservablePanelOptions
|
||||
@ -340,7 +362,7 @@ func (p ObservablePanelOptions) Max(max float64) ObservablePanelOptions
|
||||
|
||||
Max sets the maximum value of the Y axis on the panel\. The default is auto\.
|
||||
|
||||
### func \(ObservablePanelOptions\) [Min](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L752>)
|
||||
### func \(ObservablePanelOptions\) [Min](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L808>)
|
||||
|
||||
```go
|
||||
func (p ObservablePanelOptions) Min(min float64) ObservablePanelOptions
|
||||
@ -348,7 +370,7 @@ func (p ObservablePanelOptions) Min(min float64) ObservablePanelOptions
|
||||
|
||||
Min sets the minimum value of the Y axis on the panel\. The default is zero\.
|
||||
|
||||
### func \(ObservablePanelOptions\) [MinAuto](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L761>)
|
||||
### func \(ObservablePanelOptions\) [MinAuto](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L817>)
|
||||
|
||||
```go
|
||||
func (p ObservablePanelOptions) MinAuto() ObservablePanelOptions
|
||||
@ -358,7 +380,7 @@ Min sets the minimum value of the Y axis on the panel to auto\, instead of the d
|
||||
|
||||
This is generally only useful if trying to show negative numbers\.
|
||||
|
||||
### func \(ObservablePanelOptions\) [Unit](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L780>)
|
||||
### func \(ObservablePanelOptions\) [Unit](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L836>)
|
||||
|
||||
```go
|
||||
func (p ObservablePanelOptions) Unit(t UnitType) ObservablePanelOptions
|
||||
@ -366,7 +388,7 @@ func (p ObservablePanelOptions) Unit(t UnitType) ObservablePanelOptions
|
||||
|
||||
Unit sets the panel's Y axis unit type\.
|
||||
|
||||
## type [Row](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L479>)
|
||||
## type [Row](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L493>)
|
||||
|
||||
Row of observable metrics\.
|
||||
|
||||
@ -376,7 +398,7 @@ These correspond to a row of Grafana graphs\.
|
||||
type Row []Observable
|
||||
```
|
||||
|
||||
## type [UnitType](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L695>)
|
||||
## type [UnitType](<https://github.com/sourcegraph/sourcegraph/blob/main/monitoring/monitoring/monitoring.go#L751>)
|
||||
|
||||
UnitType for controlling the unit type display on graphs\.
|
||||
|
||||
|
||||
@ -3,73 +3,167 @@ package monitoring
|
||||
import (
|
||||
"bytes"
|
||||
"fmt"
|
||||
"io"
|
||||
"regexp"
|
||||
"strings"
|
||||
)
|
||||
|
||||
func renderDocumentation(containers []*Container) ([]byte, error) {
|
||||
var b bytes.Buffer
|
||||
fmt.Fprintf(&b, `# Alert solutions
|
||||
const (
|
||||
canonicalAlertSolutionsURL = "https://docs.sourcegraph.com/admin/observability/alert_solutions"
|
||||
canonicalDashboardsDocsURL = "https://docs.sourcegraph.com/admin/observability/dashboards"
|
||||
|
||||
This document contains possible solutions for when you find alerts are firing in Sourcegraph's monitoring.
|
||||
If your alert isn't mentioned here, or if the solution doesn't help, [contact us](mailto:support@sourcegraph.com)
|
||||
for assistance.
|
||||
alertSolutionsFile = "alert_solutions.md"
|
||||
dashboardsDocsFile = "dashboards.md"
|
||||
)
|
||||
|
||||
To learn more about Sourcegraph's alerting, see [our alerting documentation](https://docs.sourcegraph.com/admin/observability/alerting).
|
||||
const alertSolutionsHeader = `# Alert solutions
|
||||
|
||||
<!-- DO NOT EDIT: generated via: go generate ./monitoring -->
|
||||
|
||||
`)
|
||||
This document contains possible solutions for when you find alerts are firing in Sourcegraph's monitoring.
|
||||
If your alert isn't mentioned here, or if the solution doesn't help, [contact us](mailto:support@sourcegraph.com) for assistance.
|
||||
|
||||
To learn more about Sourcegraph's alerting and how to set up alerts, see [our alerting guide](https://docs.sourcegraph.com/admin/observability/alerting).
|
||||
|
||||
`
|
||||
|
||||
const dashboardsHeader = `# Dashboards reference
|
||||
|
||||
<!-- DO NOT EDIT: generated via: go generate ./monitoring -->
|
||||
|
||||
This document contains a complete reference on Sourcegraph's available dashboards, as well as details on how to interpret the panels and metrics.
|
||||
|
||||
To learn more about Sourcegraph's metrics and how to view these dashboards, see [our metrics guide](https://docs.sourcegraph.com/admin/observability/metrics).
|
||||
|
||||
`
|
||||
|
||||
func fprintSubtitle(w io.Writer, text string) {
|
||||
fmt.Fprintf(w, "<p class=\"subtitle\">%s</p>\n\n", text)
|
||||
}
|
||||
|
||||
// Write a standardized Observable header that one can reliably generate an anchor link for.
|
||||
//
|
||||
// See `observableAnchor`.
|
||||
func fprintObservableHeader(w io.Writer, c *Container, o *Observable, headerLevel int) {
|
||||
fmt.Fprint(w, strings.Repeat("#", headerLevel))
|
||||
fmt.Fprintf(w, " %s: %s\n\n", c.Name, o.Name)
|
||||
}
|
||||
|
||||
var observableDocAnchorRemoveRegexp = regexp.MustCompile("(_low|_high)$")
|
||||
|
||||
// Create an anchor link that matches `fprintObservableHeader`
|
||||
//
|
||||
// Must match Prometheus template in `docker-images/prometheus/cmd/prom-wrapper/receivers.go`
|
||||
func observableDocAnchor(c *Container, o Observable) string {
|
||||
observableAnchor := strings.ReplaceAll(observableDocAnchorRemoveRegexp.ReplaceAllString(o.Name, ""), "_", "-")
|
||||
return fmt.Sprintf("%s-%s", c.Name, observableAnchor)
|
||||
}
|
||||
|
||||
type documentation struct {
|
||||
alertSolutions bytes.Buffer
|
||||
dashboards bytes.Buffer
|
||||
}
|
||||
|
||||
func renderDocumentation(containers []*Container) (*documentation, error) {
|
||||
var docs documentation
|
||||
|
||||
fmt.Fprint(&docs.alertSolutions, alertSolutionsHeader)
|
||||
fmt.Fprint(&docs.dashboards, dashboardsHeader)
|
||||
|
||||
for _, c := range containers {
|
||||
fmt.Fprintf(&docs.dashboards, "## %s\n\n", c.Title)
|
||||
fprintSubtitle(&docs.dashboards, c.Description)
|
||||
|
||||
for _, g := range c.Groups {
|
||||
// the "General" group is top-level
|
||||
if g.Title != "General" {
|
||||
fmt.Fprintf(&docs.dashboards, "### %s: %s\n\n", c.Title, g.Title)
|
||||
}
|
||||
|
||||
for _, r := range g.Rows {
|
||||
for _, o := range r {
|
||||
if o.Warning == nil && o.Critical == nil {
|
||||
continue
|
||||
if err := docs.renderAlertSolutionEntry(c, o); err != nil {
|
||||
return nil, fmt.Errorf("error rendering alert solution entry %q %q: %w",
|
||||
c.Name, o.Name, err)
|
||||
}
|
||||
|
||||
fmt.Fprintf(&b, "## %s: %s\n\n", c.Name, o.Name)
|
||||
fmt.Fprintf(&b, `<p class="subtitle">%s: %s</p>`, o.Owner, o.Description)
|
||||
|
||||
// Render descriptions of various levels of this alert
|
||||
fmt.Fprintf(&b, "\n\n**Descriptions:**\n\n")
|
||||
var prometheusAlertNames []string
|
||||
for _, alert := range []struct {
|
||||
level string
|
||||
threshold *ObservableAlertDefinition
|
||||
}{
|
||||
{level: "warning", threshold: o.Warning},
|
||||
{level: "critical", threshold: o.Critical},
|
||||
} {
|
||||
if alert.threshold.isEmpty() {
|
||||
continue
|
||||
}
|
||||
desc, err := c.alertDescription(o, alert.threshold)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
fmt.Fprintf(&b, "- _%s_\n", desc)
|
||||
prometheusAlertNames = append(prometheusAlertNames,
|
||||
fmt.Sprintf(" \"%s\"", prometheusAlertName(alert.level, c.Name, o.Name)))
|
||||
if err := docs.renderDashboardPanelEntry(c, o); err != nil {
|
||||
return nil, fmt.Errorf("error rendering dashboard panel entry %q %q: %w",
|
||||
c.Name, o.Name, err)
|
||||
}
|
||||
fmt.Fprint(&b, "\n")
|
||||
|
||||
// Render solutions for dealing with this alert
|
||||
fmt.Fprintf(&b, "**Possible solutions:**\n\n")
|
||||
if o.PossibleSolutions != "none" {
|
||||
possibleSolutions, _ := toMarkdownList(o.PossibleSolutions)
|
||||
fmt.Fprintf(&b, "%s\n", possibleSolutions)
|
||||
}
|
||||
// add silencing configuration as another solution
|
||||
fmt.Fprintf(&b, "- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:\n\n")
|
||||
fmt.Fprintf(&b, "```json\n%s\n```\n\n", fmt.Sprintf(`"observability.silenceAlerts": [
|
||||
%s
|
||||
]`, strings.Join(prometheusAlertNames, ",\n")))
|
||||
|
||||
// Render break for readability
|
||||
fmt.Fprint(&b, "<br />\n\n")
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
return b.Bytes(), nil
|
||||
|
||||
return &docs, nil
|
||||
}
|
||||
|
||||
func (d *documentation) renderAlertSolutionEntry(c *Container, o Observable) error {
|
||||
if o.Warning == nil && o.Critical == nil {
|
||||
return nil
|
||||
}
|
||||
|
||||
fprintObservableHeader(&d.alertSolutions, c, &o, 2)
|
||||
fprintSubtitle(&d.alertSolutions, fmt.Sprintf(`%s (%s)`, o.Description, o.Owner))
|
||||
|
||||
var prometheusAlertNames []string // collect names for silencing configuration
|
||||
// Render descriptions of various levels of this alert
|
||||
fmt.Fprintf(&d.alertSolutions, "**Descriptions:**\n\n")
|
||||
for _, alert := range []struct {
|
||||
level string
|
||||
threshold *ObservableAlertDefinition
|
||||
}{
|
||||
{level: "warning", threshold: o.Warning},
|
||||
{level: "critical", threshold: o.Critical},
|
||||
} {
|
||||
if alert.threshold.isEmpty() {
|
||||
continue
|
||||
}
|
||||
desc, err := c.alertDescription(o, alert.threshold)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
fmt.Fprintf(&d.alertSolutions, "- <span class=\"badge badge-%s\">%s</span> %s\n", alert.level, alert.level, desc)
|
||||
prometheusAlertNames = append(prometheusAlertNames,
|
||||
fmt.Sprintf(" \"%s\"", prometheusAlertName(alert.level, c.Name, o.Name)))
|
||||
}
|
||||
fmt.Fprint(&d.alertSolutions, "\n")
|
||||
|
||||
// Render solutions for dealing with this alert
|
||||
fmt.Fprintf(&d.alertSolutions, "**Possible solutions:**\n\n")
|
||||
if o.PossibleSolutions != "none" {
|
||||
possibleSolutions, _ := toMarkdown(o.PossibleSolutions, true)
|
||||
fmt.Fprintf(&d.alertSolutions, "%s\n", possibleSolutions)
|
||||
}
|
||||
// add link to panel information IF there are additional details available
|
||||
if o.Interpretation != "" && o.Interpretation != "none" {
|
||||
fmt.Fprintf(&d.alertSolutions, "- **Refer to the [dashboards reference](./%s#%s)** for more help interpreting this alert and metric.\n",
|
||||
dashboardsDocsFile, observableDocAnchor(c, o))
|
||||
}
|
||||
// add silencing configuration as another solution
|
||||
fmt.Fprintf(&d.alertSolutions, "- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:\n\n")
|
||||
fmt.Fprintf(&d.alertSolutions, "```json\n%s\n```\n\n", fmt.Sprintf(`"observability.silenceAlerts": [
|
||||
%s
|
||||
]`, strings.Join(prometheusAlertNames, ",\n")))
|
||||
// render break for readability
|
||||
fmt.Fprint(&d.alertSolutions, "<br />\n\n")
|
||||
return nil
|
||||
}
|
||||
|
||||
func (d *documentation) renderDashboardPanelEntry(c *Container, o Observable) error {
|
||||
fprintObservableHeader(&d.dashboards, c, &o, 4)
|
||||
fmt.Fprintf(&d.dashboards, "This %s panel indicates %s.\n\n", o.Owner, o.Description)
|
||||
// render interpretation reference if available
|
||||
if o.Interpretation != "" && o.Interpretation != "none" {
|
||||
interpretation, _ := toMarkdown(o.Interpretation, false)
|
||||
fmt.Fprintf(&d.dashboards, "%s\n\n", interpretation)
|
||||
}
|
||||
// add link to alert solutions IF there is an alert attached
|
||||
if !o.NoAlert {
|
||||
fmt.Fprintf(&d.dashboards, "Refer to the [alert solutions reference](./%s#%s) for relevant alerts.\n",
|
||||
alertSolutionsFile, observableDocAnchor(c, o))
|
||||
}
|
||||
// render break for readability
|
||||
fmt.Fprint(&d.dashboards, "\n<br />\n\n")
|
||||
return nil
|
||||
}
|
||||
|
||||
@ -14,11 +14,6 @@ import (
|
||||
"gopkg.in/yaml.v2"
|
||||
)
|
||||
|
||||
const (
|
||||
alertSuffix = "_alert_rules.yml"
|
||||
alertSolutionsFile = "alert_solutions.md"
|
||||
)
|
||||
|
||||
const (
|
||||
localGrafanaURL = "http://127.0.0.1:3370"
|
||||
localGrafanaCredentials = "admin:admin"
|
||||
@ -54,7 +49,7 @@ func Generate(logger log15.Logger, opts GenerateOptions, containers ...*Containe
|
||||
|
||||
// Verify container configuration
|
||||
if err := container.validate(); err != nil {
|
||||
clog.Crit("Failed to validate container", "err", err)
|
||||
clog.Crit("Failed to validate Container", "err", err)
|
||||
return err
|
||||
}
|
||||
|
||||
@ -101,7 +96,7 @@ func Generate(logger log15.Logger, opts GenerateOptions, containers ...*Containe
|
||||
clog.Crit("Invalid rules", "err", err)
|
||||
return err
|
||||
}
|
||||
fileName := strings.Replace(container.Name, "-", "_", -1) + alertSuffix
|
||||
fileName := strings.Replace(container.Name, "-", "_", -1) + alertRulesFileSuffix
|
||||
generatedAssets = append(generatedAssets, fileName)
|
||||
err = ioutil.WriteFile(filepath.Join(opts.PrometheusDir, fileName), data, os.ModePerm)
|
||||
if err != nil {
|
||||
@ -130,17 +125,25 @@ func Generate(logger log15.Logger, opts GenerateOptions, containers ...*Containe
|
||||
// Generate documentation
|
||||
if opts.DocsDir != "" {
|
||||
logger.Debug("Rendering docs")
|
||||
solutions, err := renderDocumentation(containers)
|
||||
docs, err := renderDocumentation(containers)
|
||||
if err != nil {
|
||||
logger.Crit("Unable to generate docs", "error", err)
|
||||
logger.Crit("Failed to generate docs", "error", err)
|
||||
return err
|
||||
}
|
||||
err = ioutil.WriteFile(filepath.Join(opts.DocsDir, alertSolutionsFile), solutions, os.ModePerm)
|
||||
if err != nil {
|
||||
logger.Crit("Could not write alert solutions to output", "error", err)
|
||||
return err
|
||||
for _, docOut := range []struct {
|
||||
path string
|
||||
data []byte
|
||||
}{
|
||||
{path: filepath.Join(opts.DocsDir, alertSolutionsFile), data: docs.alertSolutions.Bytes()},
|
||||
{path: filepath.Join(opts.DocsDir, dashboardsDocsFile), data: docs.dashboards.Bytes()},
|
||||
} {
|
||||
err = ioutil.WriteFile(docOut.path, docOut.data, os.ModePerm)
|
||||
if err != nil {
|
||||
logger.Crit("Could not write docs to path", "path", docOut.path, "error", err)
|
||||
return err
|
||||
}
|
||||
generatedAssets = append(generatedAssets, docOut.path)
|
||||
}
|
||||
generatedAssets = append(generatedAssets, alertSolutionsFile)
|
||||
}
|
||||
|
||||
// Clean up dangling assets
|
||||
|
||||
@ -30,17 +30,17 @@ type Container struct {
|
||||
|
||||
func (c *Container) validate() error {
|
||||
if !isValidUID(c.Name) {
|
||||
return fmt.Errorf("Container.Name must be lowercase alphanumeric + dashes; found \"%s\"", c.Name)
|
||||
return fmt.Errorf("Name must be lowercase alphanumeric + dashes; found \"%s\"", c.Name)
|
||||
}
|
||||
if c.Title != strings.Title(c.Title) {
|
||||
return fmt.Errorf("Container.Title must be in Title Case; found \"%s\" want \"%s\"", c.Title, strings.Title(c.Title))
|
||||
return fmt.Errorf("Title must be in Title Case; found \"%s\" want \"%s\"", c.Title, strings.Title(c.Title))
|
||||
}
|
||||
if c.Description != withPeriod(c.Description) || c.Description != upperFirst(c.Description) {
|
||||
return fmt.Errorf("Container.Description must be sentence starting with an uppercas eletter and ending with period; found \"%s\"", c.Description)
|
||||
return fmt.Errorf("Description must be sentence starting with an uppercas eletter and ending with period; found \"%s\"", c.Description)
|
||||
}
|
||||
for _, g := range c.Groups {
|
||||
for i, g := range c.Groups {
|
||||
if err := g.validate(); err != nil {
|
||||
return fmt.Errorf("group %q: %v", g.Title, err)
|
||||
return fmt.Errorf("Group %d %q: %v", i, g.Title, err)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
@ -195,6 +195,20 @@ func (c *Container) renderDashboard() *sdk.Board {
|
||||
Show: true,
|
||||
}
|
||||
|
||||
// Add reference links
|
||||
panel.Links = []sdk.Link{{
|
||||
Title: "Panel reference",
|
||||
URL: stringPtr(fmt.Sprintf("%s#%s", canonicalDashboardsDocsURL, observableDocAnchor(c, o))),
|
||||
TargetBlank: boolPtr(true),
|
||||
}}
|
||||
if !o.NoAlert {
|
||||
panel.Links = append(panel.Links, sdk.Link{
|
||||
Title: "Alerts reference",
|
||||
URL: stringPtr(fmt.Sprintf("%s#%s", canonicalAlertSolutionsURL, observableDocAnchor(c, o))),
|
||||
TargetBlank: boolPtr(true),
|
||||
})
|
||||
}
|
||||
|
||||
opt := o.PanelOptions.withDefaults()
|
||||
leftAxis := sdk.Axis{
|
||||
Decimals: 0,
|
||||
@ -463,11 +477,11 @@ type Group struct {
|
||||
|
||||
func (g Group) validate() error {
|
||||
if g.Title != upperFirst(g.Title) || g.Title == withPeriod(g.Title) {
|
||||
return fmt.Errorf("Group.Title must start with an uppercase letter and not end with a period; found \"%s\"", g.Title)
|
||||
return fmt.Errorf("Title must start with an uppercase letter and not end with a period; found \"%s\"", g.Title)
|
||||
}
|
||||
for i, r := range g.Rows {
|
||||
if err := r.validate(); err != nil {
|
||||
return fmt.Errorf("row %d: %v", i, err)
|
||||
return fmt.Errorf("Row %d: %v", i, err)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
@ -482,9 +496,9 @@ func (r Row) validate() error {
|
||||
if len(r) < 1 || len(r) > 4 {
|
||||
return fmt.Errorf("row must have 1 to 4 observables only, found %v", len(r))
|
||||
}
|
||||
for _, o := range r {
|
||||
for i, o := range r {
|
||||
if err := o.validate(); err != nil {
|
||||
return fmt.Errorf("observable %q: %v", o.Name, err)
|
||||
return fmt.Errorf("Observable %d %q: %v", i, o.Name, err)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
@ -540,7 +554,7 @@ type Observable struct {
|
||||
//
|
||||
Description string
|
||||
|
||||
// Owner indicates the team that owns any alerts associated with this Observable.
|
||||
// Owner indicates the team that owns this Observable (including its alerts and maintainence).
|
||||
Owner ObservableOwner
|
||||
|
||||
// Query is the actual Prometheus query that should be observed.
|
||||
@ -567,17 +581,23 @@ type Observable struct {
|
||||
DataMayNotBeNaN bool
|
||||
|
||||
// Warning and Critical alert definitions.
|
||||
// Consider adding at least a Warning or Critical alert to each Observable to make it easy to
|
||||
// identify when the target of this metric is missbehaving.
|
||||
// Consider adding at least a Warning or Critical alert to each Observable to make it
|
||||
// easy to identify when the target of this metric is misbehaving. If no alerts are
|
||||
// provided, NoAlert must be set and Interpretation must be provided.
|
||||
Warning, Critical *ObservableAlertDefinition
|
||||
|
||||
// NoAlerts is used by Observables that don't need any alerts.
|
||||
// We want to be explicit about this to ensure alerting is considered and if we choose not to Alert,
|
||||
// its easy to identify it is an intentional behavior.
|
||||
// NoAlerts must be set by Observables that do not have any alerts.
|
||||
// This ensures the omission of alerts is intentional. If set to true, an Interpretation
|
||||
// must be provided in place of PossibleSolutions.
|
||||
NoAlert bool
|
||||
|
||||
// PossibleSolutions is Markdown describing possible solutions in the event that the alert is
|
||||
// firing. If there is no clear potential resolution, "none" must be explicitly stated.
|
||||
// PossibleSolutions is Markdown describing possible solutions in the event that the
|
||||
// alert is firing. This field not required if no alerts are attached to this Observable.
|
||||
// If there is no clear potential resolution or there is no alert configured, "none"
|
||||
// must be explicitly stated.
|
||||
//
|
||||
// Use the Interpretation field for additional guidance on understanding this Observable that isn't directly related to solving it.
|
||||
// it, the Interpretation field can be provided as well.
|
||||
//
|
||||
// Contacting support should not be mentioned as part of a possible solution, as it is
|
||||
// communicated elsewhere.
|
||||
@ -600,9 +620,23 @@ type Observable struct {
|
||||
// 2. The indentation in the string literal is removed (based on the last line).
|
||||
// 3. Single quotes become backticks.
|
||||
// 4. The last line (which is all indention) is removed.
|
||||
// 5. Non-list items are converted to a list.
|
||||
//
|
||||
PossibleSolutions string
|
||||
|
||||
// Interpretation is Markdown that can serve as a reference for interpreting this
|
||||
// observable. For example, Interpretation could provide guidance on what sort of
|
||||
// patterns to look for in the observable's graph and document why this observable is
|
||||
// usefule.
|
||||
//
|
||||
// If no alerts are configured for an observable, this field is required. If the
|
||||
// Description is sufficient to capture what this Observable describes, "none" must be
|
||||
// explicitly stated.
|
||||
//
|
||||
// To make writing the Markdown more friendly in Go, string literal processing as
|
||||
// PossibleSolutions is provided, though the output is not converted to a list.
|
||||
Interpretation string
|
||||
|
||||
// PanelOptions describes some options for how to render the metric in the Grafana panel.
|
||||
PanelOptions ObservablePanelOptions
|
||||
}
|
||||
@ -617,38 +651,60 @@ func (o Observable) WithCritical(a *ObservableAlertDefinition) Observable {
|
||||
return o
|
||||
}
|
||||
|
||||
func (o Observable) WithNoAlerts() Observable {
|
||||
// WithNoAlerts disables alerting on this Observable and sets the given interpretation instead.
|
||||
func (o Observable) WithNoAlerts(interpretation string) Observable {
|
||||
o.Warning = nil
|
||||
o.Critical = nil
|
||||
o.NoAlert = true
|
||||
o.PossibleSolutions = ""
|
||||
o.Interpretation = interpretation
|
||||
return o
|
||||
}
|
||||
|
||||
func (o Observable) validate() error {
|
||||
if strings.Contains(o.Name, " ") || strings.ToLower(o.Name) != o.Name {
|
||||
return fmt.Errorf("Observable.Name must be in lower_snake_case; found \"%s\"", o.Name)
|
||||
return fmt.Errorf("Name must be in lower_snake_case; found \"%s\"", o.Name)
|
||||
}
|
||||
if v := string([]rune(o.Description)[0]); v != strings.ToLower(v) {
|
||||
return fmt.Errorf("Observable.Description must be lowercase; found \"%s\"", o.Description)
|
||||
}
|
||||
|
||||
if !o.NoAlert && o.Warning.isEmpty() && o.Critical.isEmpty() {
|
||||
return fmt.Errorf("Observable.Warning or Observable.Critical must be set or explicitly disable alerts with Observable.NoAlert")
|
||||
}
|
||||
|
||||
if l := strings.ToLower(o.PossibleSolutions); strings.Contains(l, "contact support") || strings.Contains(l, "contact us") {
|
||||
return fmt.Errorf("PossibleSolutions: should not include mentions of contacting support")
|
||||
}
|
||||
if o.PossibleSolutions == "" {
|
||||
return fmt.Errorf(`PossibleSolutions: must list solutions or "none"`)
|
||||
} else if o.PossibleSolutions != "none" {
|
||||
if _, err := toMarkdownList(o.PossibleSolutions); err != nil {
|
||||
return fmt.Errorf("PossibleSolutions: %v", err)
|
||||
}
|
||||
return fmt.Errorf("Description must be lowercase; found \"%s\"", o.Description)
|
||||
}
|
||||
if o.Owner == "" {
|
||||
return errors.New("Observable.Owner must be defined")
|
||||
return errors.New("Owner must be defined")
|
||||
}
|
||||
|
||||
allAlertsEmpty := (o.Warning.isEmpty() && o.Critical.isEmpty())
|
||||
if allAlertsEmpty || o.NoAlert {
|
||||
// Ensure lack of alerts is intentional
|
||||
if allAlertsEmpty && !o.NoAlert {
|
||||
return fmt.Errorf("Warning or Critical must be set or explicitly disable alerts with NoAlert")
|
||||
} else if !allAlertsEmpty && o.NoAlert {
|
||||
return fmt.Errorf("No Warning or Critical alert is set, but NoAlert is also true")
|
||||
}
|
||||
// PossibleSolutions if there are no alerts is redundant and likely an error
|
||||
if o.PossibleSolutions != "" {
|
||||
return fmt.Errorf(`PossibleSolutions is not required if no alerts are configured - did you mean to provide an Interpretation instead?`)
|
||||
}
|
||||
// Interpretation must be provided and valid
|
||||
if o.Interpretation == "" {
|
||||
return fmt.Errorf("Interpretation must be provided if no alerts are set")
|
||||
} else if o.Interpretation != "none" {
|
||||
if _, err := toMarkdown(o.Interpretation, false); err != nil {
|
||||
return fmt.Errorf("Interpretation cannot be converted to Markdown: %w", err)
|
||||
}
|
||||
}
|
||||
} else {
|
||||
// PossibleSolutions must be provided and valid
|
||||
if o.PossibleSolutions == "" {
|
||||
return fmt.Errorf(`PossibleSolutions must list solutions or an explicit "none"`)
|
||||
} else if o.PossibleSolutions != "none" {
|
||||
if solutions, err := toMarkdown(o.PossibleSolutions, true); err != nil {
|
||||
return fmt.Errorf("PossibleSolutions cannot be converted to Markdown: %w", err)
|
||||
} else if l := strings.ToLower(solutions); strings.Contains(l, "contact support") || strings.Contains(l, "contact us") {
|
||||
return fmt.Errorf("PossibleSolutions should not include mentions of contacting support")
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
|
||||
@ -8,6 +8,10 @@ import (
|
||||
"github.com/prometheus/common/model"
|
||||
)
|
||||
|
||||
const (
|
||||
alertRulesFileSuffix = "_alert_rules.yml"
|
||||
)
|
||||
|
||||
// prometheusAlertName creates an alertname that is unique given the combination of parameters
|
||||
func prometheusAlertName(level, service, name string) string {
|
||||
return fmt.Sprintf("%s_%s_%s", level, service, name)
|
||||
|
||||
@ -47,7 +47,7 @@ func pruneAssets(logger log15.Logger, filelist []string, grafanaDir, promDir str
|
||||
plog.Debug("Unable to access file, ignoring")
|
||||
return nil
|
||||
}
|
||||
if !strings.Contains(filepath.Base(path), alertSuffix) || info.IsDir() {
|
||||
if !strings.Contains(filepath.Base(path), alertRulesFileSuffix) || info.IsDir() {
|
||||
return nil
|
||||
}
|
||||
|
||||
|
||||
@ -23,9 +23,10 @@ func withPeriod(s string) string {
|
||||
}
|
||||
|
||||
// stringPtr converts a string value to a pointer, useful for setting fields in some APIs.
|
||||
func stringPtr(s string) *string {
|
||||
return &s
|
||||
}
|
||||
func stringPtr(s string) *string { return &s }
|
||||
|
||||
// boolPtr converts a boolean value to a pointer, useful for setting fields in some APIs.
|
||||
func boolPtr(b bool) *bool { return &b }
|
||||
|
||||
// isValidUID checks if the given string is a valid UID for entry into a Grafana dashboard. This is
|
||||
// primarily used in the URL, e.g. /-/debug/grafana/d/syntect-server/<UID> and allows us to have
|
||||
@ -46,8 +47,8 @@ func isValidUID(s string) bool {
|
||||
return true
|
||||
}
|
||||
|
||||
// toMarkdownList converts a Go string into a Markdown list
|
||||
func toMarkdownList(m string) (string, error) {
|
||||
// toMarkdown converts a Go string to Markdown, and optionally converts it to a list item if requested by forceList.
|
||||
func toMarkdown(m string, forceList bool) (string, error) {
|
||||
m = strings.TrimPrefix(m, "\n")
|
||||
|
||||
// Replace single quotes with backticks.
|
||||
@ -66,18 +67,20 @@ func toMarkdownList(m string) (string, error) {
|
||||
indentionLevel := strings.Count(baseIndention, "\t")
|
||||
removeIndention := strings.Repeat("\t", indentionLevel+1)
|
||||
for i, l := range lines[:len(lines)-1] {
|
||||
newLine := strings.TrimPrefix(l, removeIndention)
|
||||
if l == newLine {
|
||||
trimmedLine := strings.TrimPrefix(l, removeIndention)
|
||||
if l != "" && l == trimmedLine {
|
||||
return "", fmt.Errorf("inconsistent indention (line %d %q expected to start with %q)", i, l, removeIndention)
|
||||
}
|
||||
lines[i] = newLine
|
||||
lines[i] = trimmedLine
|
||||
}
|
||||
m = strings.Join(lines[:len(lines)-1], "\n")
|
||||
}
|
||||
|
||||
// If result is not a list, make it a list, so we can add items.
|
||||
if !strings.HasPrefix(m, "-") && !strings.HasPrefix(m, "*") {
|
||||
m = fmt.Sprintf("- %s", m)
|
||||
if forceList {
|
||||
// If result is not a list, make it a list, so we can add items.
|
||||
if !strings.HasPrefix(m, "-") && !strings.HasPrefix(m, "*") {
|
||||
m = fmt.Sprintf("- %s", m)
|
||||
}
|
||||
}
|
||||
|
||||
return m, nil
|
||||
|
||||
Loading…
Reference in New Issue
Block a user