codeintel: Downgrade queue size critical alert to warning (#60165)

We've been running into spurious alerts on-and-off.

We should add observability here to better narrow down why
we're getting backlogs that are getting cleared later.

In the meantime, it doesn't make sense for this to
be a critical alert.
This commit is contained in:
Varun Gandhi 2024-02-05 08:58:55 -06:00 committed by GitHub
parent d37ea39c61
commit ac49f74baa
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
2 changed files with 4 additions and 4 deletions

View File

@ -2221,7 +2221,7 @@ Generated query for critical alert: `min((sum by (app) (up{app=~".*(pgsql|codein
**Descriptions**
- <span class="badge badge-critical">critical</span> precise-code-intel-worker: 18000s+ unprocessed upload record queue longest time in queue
- <span class="badge badge-warning">warning</span> precise-code-intel-worker: 18000s+ unprocessed upload record queue longest time in queue
**Next steps**
@ -2233,7 +2233,7 @@ count being required for the volume of uploads.
```json
"observability.silenceAlerts": [
"critical_precise-code-intel-worker_codeintel_upload_queued_max_age"
"warning_precise-code-intel-worker_codeintel_upload_queued_max_age"
]
```
@ -2242,7 +2242,7 @@ count being required for the volume of uploads.
<details>
<summary>Technical details</summary>
Generated query for critical alert: `max((max(src_codeintel_upload_queued_duration_seconds_total{job=~"^precise-code-intel-worker.*"})) >= 18000)`
Generated query for warning alert: `max((max(src_codeintel_upload_queued_duration_seconds_total{job=~"^precise-code-intel-worker.*"})) >= 18000)`
</details>

View File

@ -66,7 +66,7 @@ func (codeIntelligence) NewUploadQueueGroup(containerName string) monitoring.Gro
},
QueueSize: NoAlertsOption("none"),
QueueMaxAge: CriticalOption(monitoring.Alert().GreaterOrEqual((time.Hour * 5).Seconds()), `
QueueMaxAge: WarningOption(monitoring.Alert().GreaterOrEqual((time.Hour * 5).Seconds()), `
An alert here could be indicative of a few things: an upload surfacing a pathological performance characteristic,
precise-code-intel-worker being underprovisioned for the required upload processing throughput, or a higher replica
count being required for the volume of uploads.