mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 17:31:43 +00:00
monitoring/gitserver: do not alert before janitor threshold (#44768)
Right now, we get a critical alert _before_ the janitor kicks in to enforce the default `SRC_REPOS_DESIRED_PERCENT_FREE`. A critical alert should only fire when the instance is in a critical state, but here the system may recover still by evicting deleted repositories, so we update the thresholds on `disk_space_remaining` such that: 1. warning fires when _approaching_ the default `SRC_REPOS_DESIRED_PERCENT_FREE` 2. critical fires if we surpass the default `SRC_REPOS_DESIRED_PERCENT_FREE` and gitserver is unable to recover in a short time span
This commit is contained in:
parent
4aaa2ba871
commit
381d171872
@ -64,9 +64,12 @@ import (
|
||||
)
|
||||
|
||||
var (
|
||||
reposDir = env.Get("SRC_REPOS_DIR", "/data/repos", "Root dir containing repos.")
|
||||
wantPctFree = env.MustGetInt("SRC_REPOS_DESIRED_PERCENT_FREE", 10, "Target percentage of free space on disk.")
|
||||
janitorInterval = env.MustGetDuration("SRC_REPOS_JANITOR_INTERVAL", 1*time.Minute, "Interval between cleanup runs")
|
||||
reposDir = env.Get("SRC_REPOS_DIR", "/data/repos", "Root dir containing repos.")
|
||||
|
||||
// Align these variables with the 'disk_space_remaining' alerts in monitoring
|
||||
wantPctFree = env.MustGetInt("SRC_REPOS_DESIRED_PERCENT_FREE", 10, "Target percentage of free space on disk.")
|
||||
janitorInterval = env.MustGetDuration("SRC_REPOS_JANITOR_INTERVAL", 1*time.Minute, "Interval between cleanup runs")
|
||||
|
||||
syncRepoStateInterval = env.MustGetDuration("SRC_REPOS_SYNC_STATE_INTERVAL", 10*time.Minute, "Interval between state syncs")
|
||||
syncRepoStateBatchSize = env.MustGetInt("SRC_REPOS_SYNC_STATE_BATCH_SIZE", 500, "Number of updates to perform per batch")
|
||||
syncRepoStateUpdatePerSecond = env.MustGetInt("SRC_REPOS_SYNC_STATE_UPSERT_PER_SEC", 500, "The number of updated rows allowed per second across all gitserver instances")
|
||||
|
||||
@ -1405,13 +1405,14 @@ Generated query for critical alert: `max((histogram_quantile(0.9, sum by(le) (la
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-warning">warning</span> gitserver: less than 25% disk space remaining by instance
|
||||
- <span class="badge badge-critical">critical</span> gitserver: less than 15% disk space remaining by instance
|
||||
- <span class="badge badge-warning">warning</span> gitserver: less than 15% disk space remaining by instance
|
||||
- <span class="badge badge-critical">critical</span> gitserver: less than 10% disk space remaining by instance for 10m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#gitserver-disk-space-remaining).
|
||||
- On a warning alert, you may want to provision more disk space: Sourcegraph may be about to start evicting repositories due to disk pressure, which may result in decreased performance, users having to wait for repositories to clone, etc.
|
||||
- On a critical alert, you need to provision more disk space: Sourcegraph should be evicting repositories from disk, but is either filling up faster than it can evict, or there is an issue with the janitor job.
|
||||
- More help interpreting this metric is available in the [dashboards reference](./dashboards.md#gitserver-disk-space-remaining).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
@ -1426,9 +1427,9 @@ Generated query for critical alert: `max((histogram_quantile(0.9, sum by(le) (la
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for warning alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) <= 25)`
|
||||
Generated query for warning alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 15)`
|
||||
|
||||
Generated query for critical alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) <= 15)`
|
||||
Generated query for critical alert: `min(((src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100) < 10)`
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
2
doc/admin/observability/dashboards.md
generated
2
doc/admin/observability/dashboards.md
generated
@ -4675,6 +4675,8 @@ Query: `sum by (container_label_io_kubernetes_pod_name) (rate(container_cpu_usag
|
||||
|
||||
<p class="subtitle">Disk space remaining by instance</p>
|
||||
|
||||
Indicates disk space remaining for each gitserver instance, which is used to determine when to start evicting least-used repository clones from disk (default 10%, configured by `SRC_REPOS_DESIRED_PERCENT_FREE`).
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#gitserver-disk-space-remaining) for 2 alerts related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/gitserver/gitserver?viewPanel=100020` on your Sourcegraph instance.
|
||||
|
||||
@ -94,14 +94,25 @@ func GitServer() *monitoring.Dashboard {
|
||||
Name: "disk_space_remaining",
|
||||
Description: "disk space remaining by instance",
|
||||
Query: `(src_gitserver_disk_space_available / src_gitserver_disk_space_total) * 100`,
|
||||
Warning: monitoring.Alert().LessOrEqual(25),
|
||||
Critical: monitoring.Alert().LessOrEqual(15),
|
||||
// Warning alert when we have disk space remaining that is
|
||||
// approaching the default SRC_REPOS_DESIRED_PERCENT_FREE
|
||||
Warning: monitoring.Alert().Less(15),
|
||||
// Critical alert when we have less space remaining than the
|
||||
// default SRC_REPOS_DESIRED_PERCENT_FREE some amount of time.
|
||||
// This means that gitserver should be evicting repos, but it's
|
||||
// either filling up faster than it can evict, or there is an
|
||||
// issue with the janitor job.
|
||||
Critical: monitoring.Alert().Less(10).For(10 * time.Minute),
|
||||
Panel: monitoring.Panel().LegendFormat("{{instance}}").
|
||||
Unit(monitoring.Percentage).
|
||||
With(monitoring.PanelOptions.LegendOnRight()),
|
||||
Owner: monitoring.ObservableOwnerRepoManagement,
|
||||
Interpretation: `
|
||||
Indicates disk space remaining for each gitserver instance, which is used to determine when to start evicting least-used repository clones from disk (default 10%, configured by 'SRC_REPOS_DESIRED_PERCENT_FREE').
|
||||
`,
|
||||
NextSteps: `
|
||||
- **Provision more disk space:** Sourcegraph will begin deleting least-used repository clones at 10% disk space remaining which may result in decreased performance, users having to wait for repositories to clone, etc.
|
||||
- On a warning alert, you may want to provision more disk space: Sourcegraph may be about to start evicting repositories due to disk pressure, which may result in decreased performance, users having to wait for repositories to clone, etc.
|
||||
- On a critical alert, you need to provision more disk space: Sourcegraph should be evicting repositories from disk, but is either filling up faster than it can evict, or there is an issue with the janitor job.
|
||||
`,
|
||||
},
|
||||
},
|
||||
|
||||
Loading…
Reference in New Issue
Block a user