mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 16:31:47 +00:00
monitoring(repo-updater): add duration on critical rate limit alerts, update next steps (#36683)
The existing advice is not great, because restarting the pod doesn't guarantee a new public IP, and it's possible customers are using a public gateway (e.g. Cloud NAT) with VM without a public IP address. It's also noted that this alert comes up quite frequently - I think on larger instances running up against the limit is quite common and even expected, and Sourcegraph should for the most part continue working even if it exhausts its rate limits. It becomes a critical issue only if the rate limit is exhausted immediately after a rate limit reset. We can detect this by checking to see if the rate limit is below the threshold for ~most of a window. This change addresses both issues with updates to the alerts and NextSteps.
This commit is contained in:
parent
d50f61dd6b
commit
88471744b2
@ -3802,16 +3802,18 @@ with your code hosts connections or networking issues affecting communication wi
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-critical">critical</span> repo-updater: less than 250 remaining calls to GitHub graphql API before hitting the rate limit
|
||||
- <span class="badge badge-warning">warning</span> repo-updater: less than 250 remaining calls to GitHub graphql API before hitting the rate limit
|
||||
- <span class="badge badge-critical">critical</span> repo-updater: less than 250 remaining calls to GitHub graphql API before hitting the rate limit for 50m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Try restarting the pod to get a different public IP.
|
||||
- Consider creating a new token for the indicated resource (the `name` label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#repo-updater-github-graphql-rate-limit-remaining).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_repo-updater_github_graphql_rate_limit_remaining",
|
||||
"critical_repo-updater_github_graphql_rate_limit_remaining"
|
||||
]
|
||||
```
|
||||
@ -3826,16 +3828,18 @@ with your code hosts connections or networking issues affecting communication wi
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-critical">critical</span> repo-updater: less than 250 remaining calls to GitHub rest API before hitting the rate limit
|
||||
- <span class="badge badge-warning">warning</span> repo-updater: less than 250 remaining calls to GitHub rest API before hitting the rate limit
|
||||
- <span class="badge badge-critical">critical</span> repo-updater: less than 250 remaining calls to GitHub rest API before hitting the rate limit for 50m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Try restarting the pod to get a different public IP.
|
||||
- Consider creating a new token for the indicated resource (the `name` label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#repo-updater-github-rest-rate-limit-remaining).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_repo-updater_github_rest_rate_limit_remaining",
|
||||
"critical_repo-updater_github_rest_rate_limit_remaining"
|
||||
]
|
||||
```
|
||||
@ -3850,16 +3854,18 @@ with your code hosts connections or networking issues affecting communication wi
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-critical">critical</span> repo-updater: less than 5 remaining calls to GitHub search API before hitting the rate limit
|
||||
- <span class="badge badge-warning">warning</span> repo-updater: less than 5 remaining calls to GitHub search API before hitting the rate limit
|
||||
- <span class="badge badge-critical">critical</span> repo-updater: less than 5 remaining calls to GitHub search API before hitting the rate limit for 50m0s
|
||||
|
||||
**Next steps**
|
||||
|
||||
- Try restarting the pod to get a different public IP.
|
||||
- Consider creating a new token for the indicated resource (the `name` label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
|
||||
- Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#repo-updater-github-search-rate-limit-remaining).
|
||||
- **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert:
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"warning_repo-updater_github_search_rate_limit_remaining",
|
||||
"critical_repo-updater_github_search_rate_limit_remaining"
|
||||
]
|
||||
```
|
||||
|
||||
6
doc/admin/observability/dashboards.md
generated
6
doc/admin/observability/dashboards.md
generated
@ -11550,7 +11550,7 @@ Query: `max(src_repoupdater_errored_sync_jobs_percentage)`
|
||||
|
||||
<p class="subtitle">Remaining calls to GitHub graphql API before hitting the rate limit</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#repo-updater-github-graphql-rate-limit-remaining) for 1 alert related to this panel.
|
||||
Refer to the [alerts reference](./alerts.md#repo-updater-github-graphql-rate-limit-remaining) for 2 alerts related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/repo-updater/repo-updater?viewPanel=100220` on your Sourcegraph instance.
|
||||
|
||||
@ -11569,7 +11569,7 @@ Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="graphql"})`
|
||||
|
||||
<p class="subtitle">Remaining calls to GitHub rest API before hitting the rate limit</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#repo-updater-github-rest-rate-limit-remaining) for 1 alert related to this panel.
|
||||
Refer to the [alerts reference](./alerts.md#repo-updater-github-rest-rate-limit-remaining) for 2 alerts related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/repo-updater/repo-updater?viewPanel=100221` on your Sourcegraph instance.
|
||||
|
||||
@ -11588,7 +11588,7 @@ Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="rest"})`
|
||||
|
||||
<p class="subtitle">Remaining calls to GitHub search API before hitting the rate limit</p>
|
||||
|
||||
Refer to the [alerts reference](./alerts.md#repo-updater-github-search-rate-limit-remaining) for 1 alert related to this panel.
|
||||
Refer to the [alerts reference](./alerts.md#repo-updater-github-search-rate-limit-remaining) for 2 alerts related to this panel.
|
||||
|
||||
To see this panel, visit `/-/debug/grafana/d/repo-updater/repo-updater?viewPanel=100222` on your Sourcegraph instance.
|
||||
|
||||
|
||||
@ -392,29 +392,44 @@ func RepoUpdater() *monitoring.Dashboard {
|
||||
Description: "remaining calls to GitHub graphql API before hitting the rate limit",
|
||||
Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="graphql"})`,
|
||||
// 5% of initial limit of 5000
|
||||
Critical: monitoring.Alert().LessOrEqual(250),
|
||||
Panel: monitoring.Panel().LegendFormat("{{name}}"),
|
||||
Owner: monitoring.ObservableOwnerRepoManagement,
|
||||
NextSteps: `Try restarting the pod to get a different public IP.`,
|
||||
Warning: monitoring.Alert().LessOrEqual(250),
|
||||
// Critical if most of a 60-minute reset window is spent below
|
||||
// the threshold.
|
||||
Critical: monitoring.Alert().LessOrEqual(250).For(50 * time.Minute),
|
||||
Panel: monitoring.Panel().LegendFormat("{{name}}"),
|
||||
Owner: monitoring.ObservableOwnerRepoManagement,
|
||||
NextSteps: `
|
||||
- Consider creating a new token for the indicated resource (the 'name' label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
|
||||
`,
|
||||
},
|
||||
{
|
||||
Name: "github_rest_rate_limit_remaining",
|
||||
Description: "remaining calls to GitHub rest API before hitting the rate limit",
|
||||
Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="rest"})`,
|
||||
// 5% of initial limit of 5000
|
||||
Critical: monitoring.Alert().LessOrEqual(250),
|
||||
Panel: monitoring.Panel().LegendFormat("{{name}}"),
|
||||
Owner: monitoring.ObservableOwnerRepoManagement,
|
||||
NextSteps: `Try restarting the pod to get a different public IP.`,
|
||||
Warning: monitoring.Alert().LessOrEqual(250),
|
||||
// Critical if most of a 60-minute reset window is spent below
|
||||
// the threshold.
|
||||
Critical: monitoring.Alert().LessOrEqual(250).For(50 * time.Minute),
|
||||
Panel: monitoring.Panel().LegendFormat("{{name}}"),
|
||||
Owner: monitoring.ObservableOwnerRepoManagement,
|
||||
NextSteps: `
|
||||
- Consider creating a new token for the indicated resource (the 'name' label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
|
||||
`,
|
||||
},
|
||||
{
|
||||
Name: "github_search_rate_limit_remaining",
|
||||
Description: "remaining calls to GitHub search API before hitting the rate limit",
|
||||
Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="search"})`,
|
||||
Critical: monitoring.Alert().LessOrEqual(5),
|
||||
Panel: monitoring.Panel().LegendFormat("{{name}}"),
|
||||
Owner: monitoring.ObservableOwnerRepoManagement,
|
||||
NextSteps: `Try restarting the pod to get a different public IP.`,
|
||||
Warning: monitoring.Alert().LessOrEqual(5),
|
||||
// Critical if most of a 60-minute reset window is spent below
|
||||
// the threshold.
|
||||
Critical: monitoring.Alert().LessOrEqual(5).For(50 * time.Minute),
|
||||
Panel: monitoring.Panel().LegendFormat("{{name}}"),
|
||||
Owner: monitoring.ObservableOwnerRepoManagement,
|
||||
NextSteps: `
|
||||
- Consider creating a new token for the indicated resource (the 'name' label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure.
|
||||
`,
|
||||
},
|
||||
},
|
||||
{
|
||||
|
||||
Loading…
Reference in New Issue
Block a user