Commit Graph

67 Commits

Author SHA1 Message Date
Matthew Manela
92b8ffb8e1
fix(Source): Fix documentation URLs for code hosts help pages (#63274)
It seems many of our doc links for code hosts are broken in production
due to a url changed from external_services to code_hosts. I did a find
an replace to update all the ones I could find.
2024-06-17 14:32:46 -04:00
Petri-Johan Last
df0c59ed12
Remove echo test critical alert (#63004)
The 1s echo test alert for gitserver triggers on dotcom and doesn't have any actionable consequences, so we are removing it. The warning will remain.
2024-06-03 14:11:28 +02:00
Keegan Carruthers-Smith
2685c8c324
monitoring: add golang monitoring for zoekt (#61731)
Noticed this omission when I was wondering if we had goroutine leaks.
Our other services define this.

I added a simple way to indicate the container name in title since this
is the first service we added which needs this.

Test Plan: go test. Copy paste generated query into grafana explore on
dotcom.
2024-04-12 13:46:11 +00:00
Erik Seliger
c32cfe58b8
monitoring: Make gitserver alert less trigger-friendly (#61543) 2024-04-03 15:28:17 +02:00
Erik Seliger
faf189f892
gitserver: Cleanup grafana dashboard (#60870)
- Removes duplicative disk IO panels
- Adds some warnings and next steps descriptions to the most critical things
- Hides the backend panel by default
- Adds a metric and alert for repo corruption events
2024-03-12 22:11:13 +01:00
Camden Cheek
1ead945267
Docs: update links to point to new site (#60381)
We have a number of docs links in the product that point to the old doc site. 

Method:
- Search the repo for `docs.sourcegraph.com`
- Exclude the `doc/` dir, all test fixtures, and `CHANGELOG.md`
- For each, replace `docs.sourcegraph.com` with `sourcegraph.com/docs`
- Navigate to the resulting URL ensuring it's not a dead link, updating the URL if necessary

Many of the URLs updated are just comments, but since I'm doing a manual audit of each URL anyways, I felt it was worth it to update these while I was at it.
2024-02-13 00:23:47 +00:00
Varun Gandhi
ac49f74baa
codeintel: Downgrade queue size critical alert to warning (#60165)
We've been running into spurious alerts on-and-off.

We should add observability here to better narrow down why
we're getting backlogs that are getting cleared later.

In the meantime, it doesn't make sense for this to
be a critical alert.
2024-02-05 14:58:55 +00:00
Erik Seliger
ff1332f0d8
gitserver: Remove CloneableLimiter (#59935)
IMO, this is an unnecessary optimization that increases complexity and in the current implementation locks for longer than it needs to, because the lock in Blocking clone mode is only returned when the clone has completed, limiting the concurrency more than desired.
There are also the clone limiter AND RPS limiter still in place, so we got more than enough rate limiters in place here, IMO.
2024-01-31 09:46:39 +01:00
Petri-Johan Last
1bba959307
Remove blobstore latency alert (#59665) 2024-01-17 19:12:37 -08:00
Geoffrey Gilmore
616e3df4b9
monitoring: fix alert definition for site configuration by adding scrape job label (#59687)
We discovered recently that the definition for the alert that fires if the site configuration hasn't been fetched within 5 minutes strips out the regex that targets individual services (since it uses a grafana variable). This means that every instance of this alert will fire if any individual service trips over this threshold.

This PR fixes the issue by adding a new `job` filter for this alert that targets only the services that have that Prometheus scrape target name. This works around the previous issue by using a fixed value for the `job` value instead of a dynamic grafana value.

The value of the job filter generally looks like `job=~.*$container_name` (following the strategy from https://sourcegraph.com/github.com/sourcegraph/sourcegraph@9a780f2e694238b5326e3e121d6a1828463001b9/-/blob/monitoring/monitoring/monitoring.go?L161 ) unless I noticed that there was different logic in the existing dashboard for the services. 

Ex:

- `frontend`: already used `job=~"(sourcegraph-)?frontend"` for some metrics, so I used it again here
- `worker`: `already used `job=~"^worker.*"` in some metrics, so I used it again and standarized the other existing panels to use the same shared variable

## Test plan

I eyeballed the generated alert.md and dashboards.md to verify that my changes looked correct (that is, my refactors resulted in either no diff, or that the diff I generated still looked like valid regex).
2024-01-17 15:19:54 -08:00
Rafał Gajdulewicz
39d34cd8c9
Remove owner of NoAlert observable (#59384)
* Remove owner from NoAlert observable

* Add generated+pre-commit

* chore: handle trailing spaces

* Regen docs

---------

Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
2024-01-15 16:25:20 +01:00
Erik Seliger
4b5e9f3b8d
Move repo perms syncer to worker (#59510)
Since we have distributed rate limits now, the last dependency is broken and we can move this subsystem around freely.
To make repo-updater more lightweight, worker will be the new home of this system.

## Test plan

Ran stack locally, CI still passes including integration tests.
2024-01-11 21:09:46 +01:00
Erik Seliger
bb09a4ac1f
Remove HTTP for inter-service RPC (#59093)
In the upcoming release, we will only support gRPC going forward. This PR removes the old HTTP client and server implementations and a few leftovers from the transition.
2024-01-11 19:46:32 +01:00
Robert Lin
fc37f74865
monitoring: relax mean_blocked_seconds_per_conn_request alerts (#59507)
https://github.com/sourcegraph/sourcegraph/pull/59284 dramatically reduced the `mean_blocked_seconds_per_conn_request` issues we've been seeing, but overall delays are still higher, even with generally healthy Cloud SQL resource utilization.

<img width="1630" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/91615471-5187-4d15-83e7-5cc94595303c">

Spot-checking the spikes in load in Cloud SQL, it seems that there is a variety of causes for each spike (analytics workloads, Cody Gateway syncs, code intel workloads, gitserver things, `ListSourcegraphDotComIndexableRepos` etc) so I'm chalking this up to "expected". Since this alert is seen firing on a Cloud instance, let's just relax it for now so that it only fires a critical alert on very significant delays.
2024-01-11 01:14:28 +00:00
Petri-Johan Last
9efa6c7e2e
Adjust blobstore latency alert (#59382) 2024-01-09 08:34:27 +02:00
Robert Lin
8bce54ee62
monitoring: remove very long description (#59338) 2024-01-04 22:56:28 -08:00
Julie Tibshirani
513d4ba356
Monitoring: update owner for search platform (#59316)
The 'search core' team is now called 'search platform'.

---------

Co-authored-by: Robert Lin <robert@bobheadxi.dev>
2024-01-04 12:55:41 -08:00
Camden Cheek
191a80856f
Monitoring: update owners for code insights and batches (#59313)
This updates ownership for search, code insights, and batch changes.

---------

Co-authored-by: Robert Lin <robert@bobheadxi.dev>
2024-01-04 12:18:23 -08:00
Robert Lin
55825e9939
monitoring: test owners for valid Opsgenie teams and handbook pages (#59251)
In INC-264 it seems that certain alerts - such as [zoekt: less than 90% percentage pods available for 10m0s](https://opsg.in/a/i/sourcegraph/178a626f-0f28-4295-bee9-84da988bb473-1703759057681) - don't seem to end up going anywhere because the ObservableOwner is defunct. This change adds _opt-in_ testing to report:

1. How many owners have valid Opsgenie teams
2. How many owners have valid handbook pages

In addition, we collect ObservableOwners that pass the test and use it to generate configuration for `site.json` in Sourcegraph.com: https://github.com/sourcegraph/deploy-sourcegraph-cloud/pull/18338 - this helps ensure the list is valid and not deceptively high-coverage.

The results are not great, but **enforcing** that owners are valid isn't currently in scope:

```
6/10 ObservableOwners do not have valid Opsgenie teams
3/10 ObservableOwners do not point to valid handbook pages
```

I also removed some defunct/unused functionality/owners.

## Test plan

To run these tests:

```
export OPSGENIE_API_KEY="..."
go test -timeout 30s  github.com/sourcegraph/sourcegraph/monitoring/monitoring -update -online                       
```
2023-12-29 14:07:35 -08:00
Robert Lin
95b47b7a97
monitoring: assign obvious owners to DatabaseConnectionsMonitoringGroup (#58474)
updates `DatabaseConnectionsMonitoringGroup` to accept an owner - before it was just hardcoded to `ObservableOwnerDevOps`, which is not very helpful. This assigns some of the obvious service owners:

1. Source: gitserver, repo-updater
2. Cody: embeddings (but should eventually be @sourcegraph/search-platform, along with all embeddings alerts: https://github.com/sourcegraph/sourcegraph/pull/58474#issuecomment-1821505062)

Source is an active owner based on [thread](https://sourcegraph.slack.com/archives/C0652SSUA20/p1700592165408089?thread_ts=1700549423.860019&cid=C0652SSUA20), and Cody is a fairly recent addition so hopefully it's valid.
I'm not sure the Search one is still up-to-date, so I didn't change some of the obvious search services - for now, these still point to DevOps as they did before. If it becomes problematic we can revisit later.
2023-11-22 01:06:53 +00:00
Erik Seliger
0236f9e240
Remove global lock around GitHub.com requests (#58190)
Looks like GitHub.com has become more lenient, or transparent on their docs page: https://docs.github.com/en/rest/overview/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits. The paragraph about single request per token is gone from this page! Instead, they describe secondary rate limits quite well now:

```
You may encounter a secondary rate limit if you:

Make too many concurrent requests. No more than 100 concurrent requests are allowed. This limit is shared across the REST API and GraphQL API.
Make too many requests to a single endpoint per minute. No more than 900 points per minute are allowed for REST API endpoints, and no more than 2,000 points per minute are allowed for the GraphQL API endpoint. For more information about points, see "Calculating points for the secondary rate limit."
Make too many requests per minute. No more than 90 seconds of CPU time per 60 seconds of real time is allowed. No more than 60 seconds of this CPU time may be for the GraphQL API. You can roughly estimate the CPU time by measuring the total response time for your API requests.
Create too much content on GitHub in a short amount of time. In general, no more than 80 content-generating requests per minute and no more than 500 content-generating requests per hour are allowed. Some endpoints have lower content creation limits. Content creation limits include actions taken on the GitHub web interface as well as via the REST API and GraphQL API.
```

So the limit is no longer 1, it is roughly 100. Well, that depends on what APIs you’re calling, but whatever. Strangely, in the best practices section they still say that 1 request is advised, I followed up with a support ticket with GitHub to clarify.

### Outcome

They said 100 is the limit but for certain requests the number can be lower. This doesn't convince us (team source) that it's worth keeping it.

Besides, they also document that they return a Retry-After header in this event and we already handle that with retries (if the retry is not in the too distant future). So.. I want to say that this is “no different than any other API” at this point. Sure, there are some limits that they enforce, but that’s true for all the APIs. The 1-concurrency only one was quite gnarly which totally justified the GitHub-Proxy and now the redis-based replacement IMO, but I don’t think with the recent changes here it does warrant a github.com-only special casing (pending talking to GitHub about that docs weirdness), and instead of investing into moving the concurrency lock into the transport layer, I think we should be fine dropping it altogether.
2023-11-15 14:20:06 +01:00
Robert Lin
9009bb3d04
monitoring/telemetry: un-hide panels, improve docstrings (#57740) 2023-10-19 12:52:21 -07:00
Geoffrey Gilmore
9d34a48425
conf: add metric and associated alert if clients fail to update site configuration within 5 minutes (#57682) 2023-10-18 23:53:55 +00:00
Robert Lin
255f7eda39
monitoring: add queue growth panel and alert (#57222) 2023-10-02 10:59:41 -04:00
Robert Lin
96f2d595e0
monitoring: add telemetrygatewayexporter panels, improve metrics (#57171)
Part of https://github.com/sourcegraph/sourcegraph/issues/56970 - this adds some dashboards for the export side of things, as well as improves the existing metrics. Only includes warnings.

## Test plan

Had to test locally only because I ended up changing the metrics a bit, but validated that the queue size metric works in S2.

Testing locally:

```yaml
# sg.config.overwrite.yaml
env:
  TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL: "30s"
  TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION: "5m"
  TELEMETRY_GATEWAY_EXPORTER_QUEUE_CLEANUP_INTERVAL: "10m"
```

```
sg start
sg start monitoring
```

Do lots of searches to generate events. Note `telemetry-export` feature flag must be enabled

Data is not realistic because of the super high interval I configured for testing, but it shows that things work:

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/c44cd60e-514e-4b62-a6b6-890582d8059c)
2023-09-29 17:10:07 +00:00
Robert Lin
660996100c
monitoring: make email_delivery_failures on percentage, not count (#57045)
This changes the threshold to critical-alert on 10% failure rate, and warn on any non-zero failure rate (as all email delivery failures impact user experience).
2023-09-26 09:04:29 -07:00
Erik Seliger
711ee1a495
Remove GitHub proxy service (#56485)
This service is being replaced by a redsync.Mutex that lives directly in the GitHub client.
By this change we will:
- Simplify deployments by removing one service
- Centralize GitHub access control in the client instead of splitting it across services
- Remove the dependency on a non-HA service to talk to GitHub.com successfully

Other repos referencing this service will be updated once this has shipped to dotcom and proven to work over the course of a couple days.
2023-09-14 19:43:40 +02:00
Manuel Ucles
2892eba932
monitoring: update otel dashboard metrics (#55899)
* add metrics to OTEL

* monitoring: update otel dashboard metrics

* create a seperate group for Queue Length

* Update monitoring/definitions/otel_collector.go

Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>

* update auto gen doc

* update auto gen alerts doc

---------

Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
2023-08-17 01:03:12 +00:00
Jean-Hadrien Chabran
f7703a4552
sg: ease transition to bazel handled go:generate (#55200)
* sg: gen.. go also runs //dev:write_all_generated

* doc: comment in docs now points to the right cmd

* sg: deprecate sg run monitoring-generator
2023-08-14 23:17:25 +02:00
Thorsten Ball
e6323ee6e4
Update o11y ownership from IAM/repo-mgmt to Source (#54368)
Does what it says in the title.

## Test plan

- Ran `sg generate`
2023-06-28 14:49:16 +01:00
Milan Freml
dbaa89269b
chore: remove unused metrics for perms syncer (#50605)
## Description

These are no longer published from anywhere in the code, so it's useles
to keep the definitions.

Actually got here by seeing linter complaining about unused functions.

## Test plan

not really much to test, mostly removal of code
2023-04-13 19:37:59 +02:00
Alex Ostrikov
a5e0970534
chore: add monitoring doc (#49284)
Test plan:
CI.
2023-03-15 12:55:16 +04:00
William Bezuidenhout
a070f8e3a2
lint fix (#49281)
Fixes the lint error on main
## Test plan
sg lint --fix
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
2023-03-14 05:28:29 +00:00
Idan Varsano
2f511740d7
Small correction on the blob_load_latency alert (#49266)
## Test plan

doc change
2023-03-13 22:44:39 +00:00
Jason Hawk Harris
524f3e1061
add better description to blob_load_latency (#49259)
## Test plan
eye test

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
2023-03-13 21:51:52 +00:00
Robert Lin
7706898fec
monitoring: downgrade commit_graph staleness critical alert to warning (#49059)
This alert has paged the Cloud team on several occasions, and doesn't
seem immediately actionable - see
https://github.com/sourcegraph/customer/issues/1988 and related
discussions. tl;dr it seems that this most often happens when a few
repos are too large or take too long to clone or work on, which isn't
actionable enough to warrant a critical alert (all the recent
occurrences indicate none of the relevant services are starved for
resources) - this PR proposes that we downgrade it to a warning.

## Test plan

CI
2023-03-10 01:43:15 +00:00
Rok Novosel
b37650e665
embeddings: container metrics (#48897)
Adds basic resource metrics for embeddings container.

## Test plan

* Manual testing.
2023-03-08 10:42:10 +01:00
Stefan Hengl
6bcdf37544
alert: mention shard merging in alert's next steps (#47895)
Shard merging reduces the number of mmapped files and is a good option
to resolve this alert.

## Test plan
- just a copy change 

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
2023-02-24 10:07:23 +01:00
Keegan Carruthers-Smith
979f490a36
monitoring: vastly improved searcher dashboard (#47654)
While investigating searcher metrics I noticed our dashboard for it
pretty much has no information vs the wealth of metrics we record. As
such I spiked an hour this morning adding dashboards to grafana. For
context see these notes
https://gist.github.com/keegancsmith/9e53a1df12b5b863249c59539c0410fd

Test Plan: Ran monitoring stack against production prometheus. Note: I
wasted a lot of time trying to get this work on linux, but it seems a
bunch of our stuff is broken there for local dev + grafana. So this only
worked on my macbook.

  kubectl --namespace=prod port-forward svc/prometheus 9090:30090
  sg run grafana monitoring-generator
2023-02-15 18:05:14 +02:00
Michael Lin
c174547b56
monitoring: disable executors rate critial error (#47425) 2023-02-06 21:24:26 +01:00
David Veszelovszki
e0f2e25c14
Regenerate two md files (#47183) 2023-01-31 14:25:45 +01:00
David Veszelovszki
68ed0abf80
Docs: Replace hyphens in text with em-dashes (#42367) 2023-01-31 13:18:49 +01:00
Robert Lin
40eedcac19
monitoring: extract into a submodule (#45786)
This change extracts `monitoring` into a submodule for import in `sourcegraph/controller` (https://github.com/sourcegraph/controller/pull/195) so that we can generate dashboards for Cloud instances. These steps were required:

1. Initialize a `go.mod` in `monitoring`
2. Extract `dev/sg/internal/cliutil` into `lib` to avoid illegal imports from `monitoring`
3. Add local replaces to both `sourcegraph/sourcegraph` and `monitoring`
4. `go mod tidy` on all submodules
5. Update `go generate ./monitoring` commands to use `sg`, since the `go generate` command no longer works
6. Update `grafana/build.sh`, `prometheus/build.sh` to build the submodule
7. Amend linters to check for multiple `go.mod` files and ban imports of `github.com/sourcegraph/sourcegraph`
8. Update `sg generate go` to run in directories rather than from root

The only caveat is that if you use VS code, you will now need to open `monitoring` in a separate workspace or similar, like with `lib`.

Co-authored-by: Joe Chen <joe@sourcegraph.com>
2022-12-19 17:49:25 +00:00
William Bezuidenhout
6c7389f37c
otel: add collector dashboard (#45009)
* add initial dashboard for otel

* add failed sent dashboard

* extra panels

* use sum and rate for resource queries

* review comments

* add warning alerts

* Update monitoring/definitions/otel_collector.go

* review comments

* run go generate

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* review comments

* review feedback also drop two panels

* remove brackets in metrics

* update docs

* fix goimport

* gogenerate

Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
2022-12-19 13:18:51 +01:00
Robert Lin
c77aa64e0f
monitoring: update email panels, add multi-instance email panel (#44998) 2022-12-01 11:50:16 -08:00
Geoffrey Gilmore
983549d654
monitoring: zoekt - lower alert thresholds for memory map areas by 10% (#44376) 2022-11-28 18:27:34 +00:00
Robert Lin
381d171872
monitoring/gitserver: do not alert before janitor threshold (#44768)
Right now, we get a critical alert _before_ the janitor kicks in to enforce the default `SRC_REPOS_DESIRED_PERCENT_FREE`. A critical alert should only fire when the instance is in a critical state, but here the system may recover still by evicting deleted repositories, so we update the thresholds on `disk_space_remaining` such that:

1. warning fires when _approaching_ the default `SRC_REPOS_DESIRED_PERCENT_FREE`
2. critical fires if we surpass the default `SRC_REPOS_DESIRED_PERCENT_FREE` and gitserver is unable to recover in a short time span
2022-11-23 11:35:11 -08:00
Naman Kumar
4345944f0a
[Permissions Syncs] add new metrics tracking (#43736) 2022-11-09 22:45:54 +05:30
Noah S-C
dc1ef37e63
codeintel: add variable to service dashboards to filter by 'app' (#43865) 2022-11-03 15:54:37 +00:00
Robert Lin
e72981edc2
monitoring: add email delivery monitoring (#43281) 2022-10-31 17:10:03 +01:00