mirror of
https://github.com/sourcegraph/sourcegraph.git
synced 2026-02-06 17:51:57 +00:00
codeintel: Downgrade queue size critical alert to warning (#60165)
We've been running into spurious alerts on-and-off. We should add observability here to better narrow down why we're getting backlogs that are getting cleared later. In the meantime, it doesn't make sense for this to be a critical alert.
This commit is contained in:
parent
d37ea39c61
commit
ac49f74baa
@ -2221,7 +2221,7 @@ Generated query for critical alert: `min((sum by (app) (up{app=~".*(pgsql|codein
|
||||
|
||||
**Descriptions**
|
||||
|
||||
- <span class="badge badge-critical">critical</span> precise-code-intel-worker: 18000s+ unprocessed upload record queue longest time in queue
|
||||
- <span class="badge badge-warning">warning</span> precise-code-intel-worker: 18000s+ unprocessed upload record queue longest time in queue
|
||||
|
||||
**Next steps**
|
||||
|
||||
@ -2233,7 +2233,7 @@ count being required for the volume of uploads.
|
||||
|
||||
```json
|
||||
"observability.silenceAlerts": [
|
||||
"critical_precise-code-intel-worker_codeintel_upload_queued_max_age"
|
||||
"warning_precise-code-intel-worker_codeintel_upload_queued_max_age"
|
||||
]
|
||||
```
|
||||
|
||||
@ -2242,7 +2242,7 @@ count being required for the volume of uploads.
|
||||
<details>
|
||||
<summary>Technical details</summary>
|
||||
|
||||
Generated query for critical alert: `max((max(src_codeintel_upload_queued_duration_seconds_total{job=~"^precise-code-intel-worker.*"})) >= 18000)`
|
||||
Generated query for warning alert: `max((max(src_codeintel_upload_queued_duration_seconds_total{job=~"^precise-code-intel-worker.*"})) >= 18000)`
|
||||
|
||||
</details>
|
||||
|
||||
|
||||
@ -66,7 +66,7 @@ func (codeIntelligence) NewUploadQueueGroup(containerName string) monitoring.Gro
|
||||
},
|
||||
|
||||
QueueSize: NoAlertsOption("none"),
|
||||
QueueMaxAge: CriticalOption(monitoring.Alert().GreaterOrEqual((time.Hour * 5).Seconds()), `
|
||||
QueueMaxAge: WarningOption(monitoring.Alert().GreaterOrEqual((time.Hour * 5).Seconds()), `
|
||||
An alert here could be indicative of a few things: an upload surfacing a pathological performance characteristic,
|
||||
precise-code-intel-worker being underprovisioned for the required upload processing throughput, or a higher replica
|
||||
count being required for the volume of uploads.
|
||||
|
||||
Loading…
Reference in New Issue
Block a user