It seems many of our doc links for code hosts are broken in production
due to a url changed from external_services to code_hosts. I did a find
an replace to update all the ones I could find.
Noticed this omission when I was wondering if we had goroutine leaks.
Our other services define this.
I added a simple way to indicate the container name in title since this
is the first service we added which needs this.
Test Plan: go test. Copy paste generated query into grafana explore on
dotcom.
- Removes duplicative disk IO panels
- Adds some warnings and next steps descriptions to the most critical things
- Hides the backend panel by default
- Adds a metric and alert for repo corruption events
We have a number of docs links in the product that point to the old doc site.
Method:
- Search the repo for `docs.sourcegraph.com`
- Exclude the `doc/` dir, all test fixtures, and `CHANGELOG.md`
- For each, replace `docs.sourcegraph.com` with `sourcegraph.com/docs`
- Navigate to the resulting URL ensuring it's not a dead link, updating the URL if necessary
Many of the URLs updated are just comments, but since I'm doing a manual audit of each URL anyways, I felt it was worth it to update these while I was at it.
We've been running into spurious alerts on-and-off.
We should add observability here to better narrow down why
we're getting backlogs that are getting cleared later.
In the meantime, it doesn't make sense for this to
be a critical alert.
IMO, this is an unnecessary optimization that increases complexity and in the current implementation locks for longer than it needs to, because the lock in Blocking clone mode is only returned when the clone has completed, limiting the concurrency more than desired.
There are also the clone limiter AND RPS limiter still in place, so we got more than enough rate limiters in place here, IMO.
We discovered recently that the definition for the alert that fires if the site configuration hasn't been fetched within 5 minutes strips out the regex that targets individual services (since it uses a grafana variable). This means that every instance of this alert will fire if any individual service trips over this threshold.
This PR fixes the issue by adding a new `job` filter for this alert that targets only the services that have that Prometheus scrape target name. This works around the previous issue by using a fixed value for the `job` value instead of a dynamic grafana value.
The value of the job filter generally looks like `job=~.*$container_name` (following the strategy from https://sourcegraph.com/github.com/sourcegraph/sourcegraph@9a780f2e694238b5326e3e121d6a1828463001b9/-/blob/monitoring/monitoring/monitoring.go?L161 ) unless I noticed that there was different logic in the existing dashboard for the services.
Ex:
- `frontend`: already used `job=~"(sourcegraph-)?frontend"` for some metrics, so I used it again here
- `worker`: `already used `job=~"^worker.*"` in some metrics, so I used it again and standarized the other existing panels to use the same shared variable
## Test plan
I eyeballed the generated alert.md and dashboards.md to verify that my changes looked correct (that is, my refactors resulted in either no diff, or that the diff I generated still looked like valid regex).
Since we have distributed rate limits now, the last dependency is broken and we can move this subsystem around freely.
To make repo-updater more lightweight, worker will be the new home of this system.
## Test plan
Ran stack locally, CI still passes including integration tests.
In the upcoming release, we will only support gRPC going forward. This PR removes the old HTTP client and server implementations and a few leftovers from the transition.
https://github.com/sourcegraph/sourcegraph/pull/59284 dramatically reduced the `mean_blocked_seconds_per_conn_request` issues we've been seeing, but overall delays are still higher, even with generally healthy Cloud SQL resource utilization.
<img width="1630" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/91615471-5187-4d15-83e7-5cc94595303c">
Spot-checking the spikes in load in Cloud SQL, it seems that there is a variety of causes for each spike (analytics workloads, Cody Gateway syncs, code intel workloads, gitserver things, `ListSourcegraphDotComIndexableRepos` etc) so I'm chalking this up to "expected". Since this alert is seen firing on a Cloud instance, let's just relax it for now so that it only fires a critical alert on very significant delays.
In INC-264 it seems that certain alerts - such as [zoekt: less than 90% percentage pods available for 10m0s](https://opsg.in/a/i/sourcegraph/178a626f-0f28-4295-bee9-84da988bb473-1703759057681) - don't seem to end up going anywhere because the ObservableOwner is defunct. This change adds _opt-in_ testing to report:
1. How many owners have valid Opsgenie teams
2. How many owners have valid handbook pages
In addition, we collect ObservableOwners that pass the test and use it to generate configuration for `site.json` in Sourcegraph.com: https://github.com/sourcegraph/deploy-sourcegraph-cloud/pull/18338 - this helps ensure the list is valid and not deceptively high-coverage.
The results are not great, but **enforcing** that owners are valid isn't currently in scope:
```
6/10 ObservableOwners do not have valid Opsgenie teams
3/10 ObservableOwners do not point to valid handbook pages
```
I also removed some defunct/unused functionality/owners.
## Test plan
To run these tests:
```
export OPSGENIE_API_KEY="..."
go test -timeout 30s github.com/sourcegraph/sourcegraph/monitoring/monitoring -update -online
```
updates `DatabaseConnectionsMonitoringGroup` to accept an owner - before it was just hardcoded to `ObservableOwnerDevOps`, which is not very helpful. This assigns some of the obvious service owners:
1. Source: gitserver, repo-updater
2. Cody: embeddings (but should eventually be @sourcegraph/search-platform, along with all embeddings alerts: https://github.com/sourcegraph/sourcegraph/pull/58474#issuecomment-1821505062)
Source is an active owner based on [thread](https://sourcegraph.slack.com/archives/C0652SSUA20/p1700592165408089?thread_ts=1700549423.860019&cid=C0652SSUA20), and Cody is a fairly recent addition so hopefully it's valid.
I'm not sure the Search one is still up-to-date, so I didn't change some of the obvious search services - for now, these still point to DevOps as they did before. If it becomes problematic we can revisit later.
Looks like GitHub.com has become more lenient, or transparent on their docs page: https://docs.github.com/en/rest/overview/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits. The paragraph about single request per token is gone from this page! Instead, they describe secondary rate limits quite well now:
```
You may encounter a secondary rate limit if you:
Make too many concurrent requests. No more than 100 concurrent requests are allowed. This limit is shared across the REST API and GraphQL API.
Make too many requests to a single endpoint per minute. No more than 900 points per minute are allowed for REST API endpoints, and no more than 2,000 points per minute are allowed for the GraphQL API endpoint. For more information about points, see "Calculating points for the secondary rate limit."
Make too many requests per minute. No more than 90 seconds of CPU time per 60 seconds of real time is allowed. No more than 60 seconds of this CPU time may be for the GraphQL API. You can roughly estimate the CPU time by measuring the total response time for your API requests.
Create too much content on GitHub in a short amount of time. In general, no more than 80 content-generating requests per minute and no more than 500 content-generating requests per hour are allowed. Some endpoints have lower content creation limits. Content creation limits include actions taken on the GitHub web interface as well as via the REST API and GraphQL API.
```
So the limit is no longer 1, it is roughly 100. Well, that depends on what APIs you’re calling, but whatever. Strangely, in the best practices section they still say that 1 request is advised, I followed up with a support ticket with GitHub to clarify.
### Outcome
They said 100 is the limit but for certain requests the number can be lower. This doesn't convince us (team source) that it's worth keeping it.
Besides, they also document that they return a Retry-After header in this event and we already handle that with retries (if the retry is not in the too distant future). So.. I want to say that this is “no different than any other API” at this point. Sure, there are some limits that they enforce, but that’s true for all the APIs. The 1-concurrency only one was quite gnarly which totally justified the GitHub-Proxy and now the redis-based replacement IMO, but I don’t think with the recent changes here it does warrant a github.com-only special casing (pending talking to GitHub about that docs weirdness), and instead of investing into moving the concurrency lock into the transport layer, I think we should be fine dropping it altogether.
Part of https://github.com/sourcegraph/sourcegraph/issues/56970 - this adds some dashboards for the export side of things, as well as improves the existing metrics. Only includes warnings.
## Test plan
Had to test locally only because I ended up changing the metrics a bit, but validated that the queue size metric works in S2.
Testing locally:
```yaml
# sg.config.overwrite.yaml
env:
TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL: "30s"
TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION: "5m"
TELEMETRY_GATEWAY_EXPORTER_QUEUE_CLEANUP_INTERVAL: "10m"
```
```
sg start
sg start monitoring
```
Do lots of searches to generate events. Note `telemetry-export` feature flag must be enabled
Data is not realistic because of the super high interval I configured for testing, but it shows that things work:

This changes the threshold to critical-alert on 10% failure rate, and warn on any non-zero failure rate (as all email delivery failures impact user experience).
This service is being replaced by a redsync.Mutex that lives directly in the GitHub client.
By this change we will:
- Simplify deployments by removing one service
- Centralize GitHub access control in the client instead of splitting it across services
- Remove the dependency on a non-HA service to talk to GitHub.com successfully
Other repos referencing this service will be updated once this has shipped to dotcom and proven to work over the course of a couple days.
* add metrics to OTEL
* monitoring: update otel dashboard metrics
* create a seperate group for Queue Length
* Update monitoring/definitions/otel_collector.go
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
* update auto gen doc
* update auto gen alerts doc
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
## Description
These are no longer published from anywhere in the code, so it's useles
to keep the definitions.
Actually got here by seeing linter complaining about unused functions.
## Test plan
not really much to test, mostly removal of code
This alert has paged the Cloud team on several occasions, and doesn't
seem immediately actionable - see
https://github.com/sourcegraph/customer/issues/1988 and related
discussions. tl;dr it seems that this most often happens when a few
repos are too large or take too long to clone or work on, which isn't
actionable enough to warrant a critical alert (all the recent
occurrences indicate none of the relevant services are starved for
resources) - this PR proposes that we downgrade it to a warning.
## Test plan
CI
While investigating searcher metrics I noticed our dashboard for it
pretty much has no information vs the wealth of metrics we record. As
such I spiked an hour this morning adding dashboards to grafana. For
context see these notes
https://gist.github.com/keegancsmith/9e53a1df12b5b863249c59539c0410fd
Test Plan: Ran monitoring stack against production prometheus. Note: I
wasted a lot of time trying to get this work on linux, but it seems a
bunch of our stuff is broken there for local dev + grafana. So this only
worked on my macbook.
kubectl --namespace=prod port-forward svc/prometheus 9090:30090
sg run grafana monitoring-generator
This change extracts `monitoring` into a submodule for import in `sourcegraph/controller` (https://github.com/sourcegraph/controller/pull/195) so that we can generate dashboards for Cloud instances. These steps were required:
1. Initialize a `go.mod` in `monitoring`
2. Extract `dev/sg/internal/cliutil` into `lib` to avoid illegal imports from `monitoring`
3. Add local replaces to both `sourcegraph/sourcegraph` and `monitoring`
4. `go mod tidy` on all submodules
5. Update `go generate ./monitoring` commands to use `sg`, since the `go generate` command no longer works
6. Update `grafana/build.sh`, `prometheus/build.sh` to build the submodule
7. Amend linters to check for multiple `go.mod` files and ban imports of `github.com/sourcegraph/sourcegraph`
8. Update `sg generate go` to run in directories rather than from root
The only caveat is that if you use VS code, you will now need to open `monitoring` in a separate workspace or similar, like with `lib`.
Co-authored-by: Joe Chen <joe@sourcegraph.com>
* add initial dashboard for otel
* add failed sent dashboard
* extra panels
* use sum and rate for resource queries
* review comments
* add warning alerts
* Update monitoring/definitions/otel_collector.go
* review comments
* run go generate
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* review comments
* review feedback also drop two panels
* remove brackets in metrics
* update docs
* fix goimport
* gogenerate
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
Right now, we get a critical alert _before_ the janitor kicks in to enforce the default `SRC_REPOS_DESIRED_PERCENT_FREE`. A critical alert should only fire when the instance is in a critical state, but here the system may recover still by evicting deleted repositories, so we update the thresholds on `disk_space_remaining` such that:
1. warning fires when _approaching_ the default `SRC_REPOS_DESIRED_PERCENT_FREE`
2. critical fires if we surpass the default `SRC_REPOS_DESIRED_PERCENT_FREE` and gitserver is unable to recover in a short time span