monitoring/gitserver: do not alert before janitor threshold (#44768)

Right now, we get a critical alert _before_ the janitor kicks in to enforce the default `SRC_REPOS_DESIRED_PERCENT_FREE`. A critical alert should only fire when the instance is in a critical state, but here the system may recover still by evicting deleted repositories, so we update the thresholds on `disk_space_remaining` such that:

1. warning fires when _approaching_ the default `SRC_REPOS_DESIRED_PERCENT_FREE`
2. critical fires if we surpass the default `SRC_REPOS_DESIRED_PERCENT_FREE` and gitserver is unable to recover in a short time span
This commit is contained in:
Robert Lin 2022-11-23 11:35:11 -08:00 committed by GitHub
parent 4aaa2ba871
commit 381d171872
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 29 additions and 12 deletions

View File

@ -64,9 +64,12 @@ import (
)
var (
reposDir = env.Get("SRC_REPOS_DIR", "/data/repos", "Root dir containing repos.")
wantPctFree = env.MustGetInt("SRC_REPOS_DESIRED_PERCENT_FREE", 10, "Target percentage of free space on disk.")
janitorInterval = env.MustGetDuration("SRC_REPOS_JANITOR_INTERVAL", 1*time.Minute, "Interval between cleanup runs")
reposDir = env.Get("SRC_REPOS_DIR", "/data/repos", "Root dir containing repos.")
// Align these variables with the 'disk_space_remaining' alerts in monitoring
wantPctFree = env.MustGetInt("SRC_REPOS_DESIRED_PERCENT_FREE", 10, "Target percentage of free space on disk.")
janitorInterval = env.MustGetDuration("SRC_REPOS_JANITOR_INTERVAL", 1*time.Minute, "Interval between cleanup runs")
syncRepoStateInterval = env.MustGetDuration("SRC_REPOS_SYNC_STATE_INTERVAL", 10*time.Minute, "Interval between state syncs")
syncRepoStateBatchSize = env.MustGetInt("SRC_REPOS_SYNC_STATE_BATCH_SIZE", 500, "Number of updates to perform per batch")
syncRepoStateUpdatePerSecond = env.MustGetInt("SRC_REPOS_SYNC_STATE_UPSERT_PER_SEC", 500, "The number of updated rows allowed per second across all gitserver instances")

View File

@ -1405,13 +1405,14 @@ Generated query for critical alert: `max((histogram_quantile(0.9, sum by(le) (la
**Descriptions**
- <span class="badge badge-warning">warning</span> gitserver: less than 25% disk space remaining by instance
- <span class="badge badge-critical">critical</span> gitserver: less than 15% disk space remaining by instance
- <span class="badge badge-warning">warning</span> gitserver: less than 15% disk space remaining by instance
- <span class="badge badge-critical">critical</span> gitserver: less than 10% disk space remaining by instance for 10m0s
**Next steps**
- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#gitserver-disk-space-remaining).
- On a warning alert, you may want to provision more disk space: Sourcegraph may be about to start evicting repositories due to disk pressure, which may result in decreased performance, users having to wait for repositories to clone, etc.
- On a critical alert, you need to provision more disk space: Sourcegraph should be evicting repositories from disk, but is either filling up faster than it can evict, or there is an issue with the janitor job.
- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#gitserver-disk-space-remaining).
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
```json
@ -1426,9 +1427,9 @@ Generated query for critical alert: `max((histogram_quantile(0.9, sum by(le) (la
<details>
<summary>Technical details</summary>
Generated query for warning alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) <= 25)`
Generated query for warning alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 15)`
Generated query for critical alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) <= 15)`
Generated query for critical alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 10)`
</details>

View File

@ -4675,6 +4675,8 @@ Query: `sum by (container_label_io_kubernetes_pod_name) (rate(container_cpu_usag
<p class="subtitle">Disk space remaining by instance</p>
Indicates disk space remaining for each gitserver instance, which is used to determine when to start evicting least-used repository clones from disk (default 10%, configured by `SRC_REPOS_DESIRED_PERCENT_FREE`).
Refer to the [alerts reference](./alerts.md#gitserver-disk-space-remaining) for 2 alerts related to this panel.
To see this panel, visit `/-/debug/grafana/d/gitserver/gitserver?viewPanel=100020` on your Sourcegraph instance.

View File

@ -94,14 +94,25 @@ func GitServer() *monitoring.Dashboard {
Name: "disk_space_remaining",
Description: "disk space remaining by instance",
Query: `(src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100`,
Warning: monitoring.Alert().LessOrEqual(25),
Critical: monitoring.Alert().LessOrEqual(15),
// Warning alert when we have disk space remaining that is
// approaching the default SRC_REPOS_DESIRED_PERCENT_FREE
Warning: monitoring.Alert().Less(15),
// Critical alert when we have less space remaining than the
// default SRC_REPOS_DESIRED_PERCENT_FREE some amount of time.
// This means that gitserver should be evicting repos, but it's
// either filling up faster than it can evict, or there is an
// issue with the janitor job.
Critical: monitoring.Alert().Less(10).For(10 * time.Minute),
Panel: monitoring.Panel().LegendFormat("{{instance}}").
Unit(monitoring.Percentage).
With(monitoring.PanelOptions.LegendOnRight()),
Owner: monitoring.ObservableOwnerRepoManagement,
Interpretation: `
Indicates disk space remaining for each gitserver instance, which is used to determine when to start evicting least-used repository clones from disk (default 10%, configured by 'SRC_REPOS_DESIRED_PERCENT_FREE').
`,
NextSteps: `
- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
- On a warning alert, you may want to provision more disk space: Sourcegraph may be about to start evicting repositories due to disk pressure, which may result in decreased performance, users having to wait for repositories to clone, etc.
- On a critical alert, you need to provision more disk space: Sourcegraph should be evicting repositories from disk, but is either filling up faster than it can evict, or there is an issue with the janitor job.
`,
},
},