From 88471744b2ca4e2a1dcdda20eba8de575ebe13d5 Mon Sep 17 00:00:00 2001 From: Robert Lin Date: Tue, 7 Jun 2022 08:18:48 -0700 Subject: [PATCH] monitoring(repo-updater): add duration on critical rate limit alerts, update next steps (#36683) The existing advice is not great, because restarting the pod doesn't guarantee a new public IP, and it's possible customers are using a public gateway (e.g. Cloud NAT) with VM without a public IP address. It's also noted that this alert comes up quite frequently - I think on larger instances running up against the limit is quite common and even expected, and Sourcegraph should for the most part continue working even if it exhausts its rate limits. It becomes a critical issue only if the rate limit is exhausted immediately after a rate limit reset. We can detect this by checking to see if the rate limit is below the threshold for ~most of a window. This change addresses both issues with updates to the alerts and NextSteps. --- doc/admin/observability/alerts.md | 18 ++++++++---- doc/admin/observability/dashboards.md | 6 ++-- monitoring/definitions/repo_updater.go | 39 ++++++++++++++++++-------- 3 files changed, 42 insertions(+), 21 deletions(-) diff --git a/doc/admin/observability/alerts.md b/doc/admin/observability/alerts.md index e937c63a7f8..9894a9cf301 100644 --- a/doc/admin/observability/alerts.md +++ b/doc/admin/observability/alerts.md @@ -3802,16 +3802,18 @@ with your code hosts connections or networking issues affecting communication wi **Descriptions** -- critical repo-updater: less than 250 remaining calls to GitHub graphql API before hitting the rate limit +- warning repo-updater: less than 250 remaining calls to GitHub graphql API before hitting the rate limit +- critical repo-updater: less than 250 remaining calls to GitHub graphql API before hitting the rate limit for 50m0s **Next steps** -- Try restarting the pod to get a different public IP. +- Consider creating a new token for the indicated resource (the `name` label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. - Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#repo-updater-github-graphql-rate-limit-remaining). - **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: ```json "observability.silenceAlerts": [ + "warning_repo-updater_github_graphql_rate_limit_remaining", "critical_repo-updater_github_graphql_rate_limit_remaining" ] ``` @@ -3826,16 +3828,18 @@ with your code hosts connections or networking issues affecting communication wi **Descriptions** -- critical repo-updater: less than 250 remaining calls to GitHub rest API before hitting the rate limit +- warning repo-updater: less than 250 remaining calls to GitHub rest API before hitting the rate limit +- critical repo-updater: less than 250 remaining calls to GitHub rest API before hitting the rate limit for 50m0s **Next steps** -- Try restarting the pod to get a different public IP. +- Consider creating a new token for the indicated resource (the `name` label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. - Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#repo-updater-github-rest-rate-limit-remaining). - **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: ```json "observability.silenceAlerts": [ + "warning_repo-updater_github_rest_rate_limit_remaining", "critical_repo-updater_github_rest_rate_limit_remaining" ] ``` @@ -3850,16 +3854,18 @@ with your code hosts connections or networking issues affecting communication wi **Descriptions** -- critical repo-updater: less than 5 remaining calls to GitHub search API before hitting the rate limit +- warning repo-updater: less than 5 remaining calls to GitHub search API before hitting the rate limit +- critical repo-updater: less than 5 remaining calls to GitHub search API before hitting the rate limit for 50m0s **Next steps** -- Try restarting the pod to get a different public IP. +- Consider creating a new token for the indicated resource (the `name` label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. - Learn more about the related dashboard panel in the [dashboards reference](./dashboards.md#repo-updater-github-search-rate-limit-remaining). - **Silence this alert:** If you are aware of this alert and want to silence notifications for it, add the following to your site configuration and set a reminder to re-evaluate the alert: ```json "observability.silenceAlerts": [ + "warning_repo-updater_github_search_rate_limit_remaining", "critical_repo-updater_github_search_rate_limit_remaining" ] ``` diff --git a/doc/admin/observability/dashboards.md b/doc/admin/observability/dashboards.md index 955516cad7c..fdc02855b9c 100644 --- a/doc/admin/observability/dashboards.md +++ b/doc/admin/observability/dashboards.md @@ -11550,7 +11550,7 @@ Query: `max(src_repoupdater_errored_sync_jobs_percentage)`

Remaining calls to GitHub graphql API before hitting the rate limit

-Refer to the [alerts reference](./alerts.md#repo-updater-github-graphql-rate-limit-remaining) for 1 alert related to this panel. +Refer to the [alerts reference](./alerts.md#repo-updater-github-graphql-rate-limit-remaining) for 2 alerts related to this panel. To see this panel, visit `/-/debug/grafana/d/repo-updater/repo-updater?viewPanel=100220` on your Sourcegraph instance. @@ -11569,7 +11569,7 @@ Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="graphql"})`

Remaining calls to GitHub rest API before hitting the rate limit

-Refer to the [alerts reference](./alerts.md#repo-updater-github-rest-rate-limit-remaining) for 1 alert related to this panel. +Refer to the [alerts reference](./alerts.md#repo-updater-github-rest-rate-limit-remaining) for 2 alerts related to this panel. To see this panel, visit `/-/debug/grafana/d/repo-updater/repo-updater?viewPanel=100221` on your Sourcegraph instance. @@ -11588,7 +11588,7 @@ Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="rest"})`

Remaining calls to GitHub search API before hitting the rate limit

-Refer to the [alerts reference](./alerts.md#repo-updater-github-search-rate-limit-remaining) for 1 alert related to this panel. +Refer to the [alerts reference](./alerts.md#repo-updater-github-search-rate-limit-remaining) for 2 alerts related to this panel. To see this panel, visit `/-/debug/grafana/d/repo-updater/repo-updater?viewPanel=100222` on your Sourcegraph instance. diff --git a/monitoring/definitions/repo_updater.go b/monitoring/definitions/repo_updater.go index 771be683a1c..42af3f205f2 100644 --- a/monitoring/definitions/repo_updater.go +++ b/monitoring/definitions/repo_updater.go @@ -392,29 +392,44 @@ func RepoUpdater() *monitoring.Dashboard { Description: "remaining calls to GitHub graphql API before hitting the rate limit", Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="graphql"})`, // 5% of initial limit of 5000 - Critical: monitoring.Alert().LessOrEqual(250), - Panel: monitoring.Panel().LegendFormat("{{name}}"), - Owner: monitoring.ObservableOwnerRepoManagement, - NextSteps: `Try restarting the pod to get a different public IP.`, + Warning: monitoring.Alert().LessOrEqual(250), + // Critical if most of a 60-minute reset window is spent below + // the threshold. + Critical: monitoring.Alert().LessOrEqual(250).For(50 * time.Minute), + Panel: monitoring.Panel().LegendFormat("{{name}}"), + Owner: monitoring.ObservableOwnerRepoManagement, + NextSteps: ` + - Consider creating a new token for the indicated resource (the 'name' label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. + `, }, { Name: "github_rest_rate_limit_remaining", Description: "remaining calls to GitHub rest API before hitting the rate limit", Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="rest"})`, // 5% of initial limit of 5000 - Critical: monitoring.Alert().LessOrEqual(250), - Panel: monitoring.Panel().LegendFormat("{{name}}"), - Owner: monitoring.ObservableOwnerRepoManagement, - NextSteps: `Try restarting the pod to get a different public IP.`, + Warning: monitoring.Alert().LessOrEqual(250), + // Critical if most of a 60-minute reset window is spent below + // the threshold. + Critical: monitoring.Alert().LessOrEqual(250).For(50 * time.Minute), + Panel: monitoring.Panel().LegendFormat("{{name}}"), + Owner: monitoring.ObservableOwnerRepoManagement, + NextSteps: ` + - Consider creating a new token for the indicated resource (the 'name' label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. + `, }, { Name: "github_search_rate_limit_remaining", Description: "remaining calls to GitHub search API before hitting the rate limit", Query: `max by (name) (src_github_rate_limit_remaining_v2{resource="search"})`, - Critical: monitoring.Alert().LessOrEqual(5), - Panel: monitoring.Panel().LegendFormat("{{name}}"), - Owner: monitoring.ObservableOwnerRepoManagement, - NextSteps: `Try restarting the pod to get a different public IP.`, + Warning: monitoring.Alert().LessOrEqual(5), + // Critical if most of a 60-minute reset window is spent below + // the threshold. + Critical: monitoring.Alert().LessOrEqual(5).For(50 * time.Minute), + Panel: monitoring.Panel().LegendFormat("{{name}}"), + Owner: monitoring.ObservableOwnerRepoManagement, + NextSteps: ` + - Consider creating a new token for the indicated resource (the 'name' label for series below the threshold in the dashboard) under a dedicated machine user to reduce rate limit pressure. + `, }, }, {