sourcegraph

mirror of https://github.com/sourcegraph/sourcegraph.git synced 2026-02-06 15:12:02 +00:00

Author	SHA1	Message	Date
Matthew Manela	92b8ffb8e1	fix(Source): Fix documentation URLs for code hosts help pages (#63274 ) It seems many of our doc links for code hosts are broken in production due to a url changed from external_services to code_hosts. I did a find an replace to update all the ones I could find.	2024-06-17 14:32:46 -04:00
Petri-Johan Last	df0c59ed12	Remove echo test critical alert (#63004 ) The 1s echo test alert for gitserver triggers on dotcom and doesn't have any actionable consequences, so we are removing it. The warning will remain.	2024-06-03 14:11:28 +02:00
Keegan Carruthers-Smith	2685c8c324	monitoring: add golang monitoring for zoekt (#61731 ) Noticed this omission when I was wondering if we had goroutine leaks. Our other services define this. I added a simple way to indicate the container name in title since this is the first service we added which needs this. Test Plan: go test. Copy paste generated query into grafana explore on dotcom.	2024-04-12 13:46:11 +00:00
Erik Seliger	c32cfe58b8	monitoring: Make gitserver alert less trigger-friendly (#61543 )	2024-04-03 15:28:17 +02:00
Erik Seliger	faf189f892	gitserver: Cleanup grafana dashboard (#60870 ) - Removes duplicative disk IO panels - Adds some warnings and next steps descriptions to the most critical things - Hides the backend panel by default - Adds a metric and alert for repo corruption events	2024-03-12 22:11:13 +01:00
Camden Cheek	1ead945267	Docs: update links to point to new site (#60381 ) We have a number of docs links in the product that point to the old doc site. Method: - Search the repo for `docs.sourcegraph.com` - Exclude the `doc/` dir, all test fixtures, and `CHANGELOG.md` - For each, replace `docs.sourcegraph.com` with `sourcegraph.com/docs` - Navigate to the resulting URL ensuring it's not a dead link, updating the URL if necessary Many of the URLs updated are just comments, but since I'm doing a manual audit of each URL anyways, I felt it was worth it to update these while I was at it.	2024-02-13 00:23:47 +00:00
Varun Gandhi	ac49f74baa	codeintel: Downgrade queue size critical alert to warning (#60165 ) We've been running into spurious alerts on-and-off. We should add observability here to better narrow down why we're getting backlogs that are getting cleared later. In the meantime, it doesn't make sense for this to be a critical alert.	2024-02-05 14:58:55 +00:00
Erik Seliger	ff1332f0d8	gitserver: Remove CloneableLimiter (#59935 ) IMO, this is an unnecessary optimization that increases complexity and in the current implementation locks for longer than it needs to, because the lock in Blocking clone mode is only returned when the clone has completed, limiting the concurrency more than desired. There are also the clone limiter AND RPS limiter still in place, so we got more than enough rate limiters in place here, IMO.	2024-01-31 09:46:39 +01:00
Petri-Johan Last	1bba959307	Remove blobstore latency alert (#59665 )	2024-01-17 19:12:37 -08:00
Geoffrey Gilmore	616e3df4b9	monitoring: fix alert definition for site configuration by adding scrape job label (#59687 ) We discovered recently that the definition for the alert that fires if the site configuration hasn't been fetched within 5 minutes strips out the regex that targets individual services (since it uses a grafana variable). This means that every instance of this alert will fire if any individual service trips over this threshold. This PR fixes the issue by adding a new `job` filter for this alert that targets only the services that have that Prometheus scrape target name. This works around the previous issue by using a fixed value for the `job` value instead of a dynamic grafana value. The value of the job filter generally looks like `job=~.$container_name` (following the strategy from https://sourcegraph.com/github.com/sourcegraph/sourcegraph@9a780f2e694238b5326e3e121d6a1828463001b9/-/blob/monitoring/monitoring/monitoring.go?L161 ) unless I noticed that there was different logic in the existing dashboard for the services. Ex: - `frontend`: already used `job=~"(sourcegraph-)?frontend"` for some metrics, so I used it again here - `worker`: `already used `job=~"^worker."` in some metrics, so I used it again and standarized the other existing panels to use the same shared variable ## Test plan I eyeballed the generated alert.md and dashboards.md to verify that my changes looked correct (that is, my refactors resulted in either no diff, or that the diff I generated still looked like valid regex).	2024-01-17 15:19:54 -08:00
Rafał Gajdulewicz	39d34cd8c9	Remove owner of NoAlert observable (#59384 ) * Remove owner from NoAlert observable * Add generated+pre-commit * chore: handle trailing spaces * Regen docs --------- Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>	2024-01-15 16:25:20 +01:00
Erik Seliger	4b5e9f3b8d	Move repo perms syncer to worker (#59510 ) Since we have distributed rate limits now, the last dependency is broken and we can move this subsystem around freely. To make repo-updater more lightweight, worker will be the new home of this system. ## Test plan Ran stack locally, CI still passes including integration tests.	2024-01-11 21:09:46 +01:00
Erik Seliger	bb09a4ac1f	Remove HTTP for inter-service RPC (#59093 ) In the upcoming release, we will only support gRPC going forward. This PR removes the old HTTP client and server implementations and a few leftovers from the transition.	2024-01-11 19:46:32 +01:00
Robert Lin	fc37f74865	monitoring: relax mean_blocked_seconds_per_conn_request alerts (#59507 ) https://github.com/sourcegraph/sourcegraph/pull/59284 dramatically reduced the `mean_blocked_seconds_per_conn_request` issues we've been seeing, but overall delays are still higher, even with generally healthy Cloud SQL resource utilization. <img width="1630" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/91615471-5187-4d15-83e7-5cc94595303c"> Spot-checking the spikes in load in Cloud SQL, it seems that there is a variety of causes for each spike (analytics workloads, Cody Gateway syncs, code intel workloads, gitserver things, `ListSourcegraphDotComIndexableRepos` etc) so I'm chalking this up to "expected". Since this alert is seen firing on a Cloud instance, let's just relax it for now so that it only fires a critical alert on very significant delays.	2024-01-11 01:14:28 +00:00
Petri-Johan Last	9efa6c7e2e	Adjust blobstore latency alert (#59382 )	2024-01-09 08:34:27 +02:00
Robert Lin	8bce54ee62	monitoring: remove very long description (#59338 )	2024-01-04 22:56:28 -08:00
Julie Tibshirani	513d4ba356	Monitoring: update owner for search platform (#59316 ) The 'search core' team is now called 'search platform'. --------- Co-authored-by: Robert Lin <robert@bobheadxi.dev>	2024-01-04 12:55:41 -08:00
Camden Cheek	191a80856f	Monitoring: update owners for code insights and batches (#59313 ) This updates ownership for search, code insights, and batch changes. --------- Co-authored-by: Robert Lin <robert@bobheadxi.dev>	2024-01-04 12:18:23 -08:00
Robert Lin	55825e9939	monitoring: test owners for valid Opsgenie teams and handbook pages (#59251 ) In INC-264 it seems that certain alerts - such as [zoekt: less than 90% percentage pods available for 10m0s](https://opsg.in/a/i/sourcegraph/178a626f-0f28-4295-bee9-84da988bb473-1703759057681) - don't seem to end up going anywhere because the ObservableOwner is defunct. This change adds _opt-in_ testing to report: 1. How many owners have valid Opsgenie teams 2. How many owners have valid handbook pages In addition, we collect ObservableOwners that pass the test and use it to generate configuration for `site.json` in Sourcegraph.com: https://github.com/sourcegraph/deploy-sourcegraph-cloud/pull/18338 - this helps ensure the list is valid and not deceptively high-coverage. The results are not great, but enforcing that owners are valid isn't currently in scope: ``` 6/10 ObservableOwners do not have valid Opsgenie teams 3/10 ObservableOwners do not point to valid handbook pages ``` I also removed some defunct/unused functionality/owners. ## Test plan To run these tests: ``` export OPSGENIE_API_KEY="..." go test -timeout 30s github.com/sourcegraph/sourcegraph/monitoring/monitoring -update -online ```	2023-12-29 14:07:35 -08:00
Robert Lin	95b47b7a97	monitoring: assign obvious owners to DatabaseConnectionsMonitoringGroup (#58474 ) updates `DatabaseConnectionsMonitoringGroup` to accept an owner - before it was just hardcoded to `ObservableOwnerDevOps`, which is not very helpful. This assigns some of the obvious service owners: 1. Source: gitserver, repo-updater 2. Cody: embeddings (but should eventually be @sourcegraph/search-platform, along with all embeddings alerts: https://github.com/sourcegraph/sourcegraph/pull/58474#issuecomment-1821505062) Source is an active owner based on [thread](https://sourcegraph.slack.com/archives/C0652SSUA20/p1700592165408089?thread_ts=1700549423.860019&cid=C0652SSUA20), and Cody is a fairly recent addition so hopefully it's valid. I'm not sure the Search one is still up-to-date, so I didn't change some of the obvious search services - for now, these still point to DevOps as they did before. If it becomes problematic we can revisit later.	2023-11-22 01:06:53 +00:00
Erik Seliger	0236f9e240	Remove global lock around GitHub.com requests (#58190 ) Looks like GitHub.com has become more lenient, or transparent on their docs page: https://docs.github.com/en/rest/overview/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits. The paragraph about single request per token is gone from this page! Instead, they describe secondary rate limits quite well now: ``` You may encounter a secondary rate limit if you: Make too many concurrent requests. No more than 100 concurrent requests are allowed. This limit is shared across the REST API and GraphQL API. Make too many requests to a single endpoint per minute. No more than 900 points per minute are allowed for REST API endpoints, and no more than 2,000 points per minute are allowed for the GraphQL API endpoint. For more information about points, see "Calculating points for the secondary rate limit." Make too many requests per minute. No more than 90 seconds of CPU time per 60 seconds of real time is allowed. No more than 60 seconds of this CPU time may be for the GraphQL API. You can roughly estimate the CPU time by measuring the total response time for your API requests. Create too much content on GitHub in a short amount of time. In general, no more than 80 content-generating requests per minute and no more than 500 content-generating requests per hour are allowed. Some endpoints have lower content creation limits. Content creation limits include actions taken on the GitHub web interface as well as via the REST API and GraphQL API. ``` So the limit is no longer 1, it is roughly 100. Well, that depends on what APIs you’re calling, but whatever. Strangely, in the best practices section they still say that 1 request is advised, I followed up with a support ticket with GitHub to clarify. ### Outcome They said 100 is the limit but for certain requests the number can be lower. This doesn't convince us (team source) that it's worth keeping it. Besides, they also document that they return a Retry-After header in this event and we already handle that with retries (if the retry is not in the too distant future). So.. I want to say that this is “no different than any other API” at this point. Sure, there are some limits that they enforce, but that’s true for all the APIs. The 1-concurrency only one was quite gnarly which totally justified the GitHub-Proxy and now the redis-based replacement IMO, but I don’t think with the recent changes here it does warrant a github.com-only special casing (pending talking to GitHub about that docs weirdness), and instead of investing into moving the concurrency lock into the transport layer, I think we should be fine dropping it altogether.	2023-11-15 14:20:06 +01:00
Robert Lin	9009bb3d04	monitoring/telemetry: un-hide panels, improve docstrings (#57740 )	2023-10-19 12:52:21 -07:00
Geoffrey Gilmore	9d34a48425	conf: add metric and associated alert if clients fail to update site configuration within 5 minutes (#57682 )	2023-10-18 23:53:55 +00:00
Robert Lin	255f7eda39	monitoring: add queue growth panel and alert (#57222 )	2023-10-02 10:59:41 -04:00
Robert Lin	96f2d595e0	monitoring: add telemetrygatewayexporter panels, improve metrics (#57171 ) Part of https://github.com/sourcegraph/sourcegraph/issues/56970 - this adds some dashboards for the export side of things, as well as improves the existing metrics. Only includes warnings. ## Test plan Had to test locally only because I ended up changing the metrics a bit, but validated that the queue size metric works in S2. Testing locally: ```yaml # sg.config.overwrite.yaml env: TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL: "30s" TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION: "5m" TELEMETRY_GATEWAY_EXPORTER_QUEUE_CLEANUP_INTERVAL: "10m" ``` ``` sg start sg start monitoring ``` Do lots of searches to generate events. Note `telemetry-export` feature flag must be enabled Data is not realistic because of the super high interval I configured for testing, but it shows that things work: ![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/c44cd60e-514e-4b62-a6b6-890582d8059c)	2023-09-29 17:10:07 +00:00
Robert Lin	660996100c	monitoring: make email_delivery_failures on percentage, not count (#57045 ) This changes the threshold to critical-alert on 10% failure rate, and warn on any non-zero failure rate (as all email delivery failures impact user experience).	2023-09-26 09:04:29 -07:00
Erik Seliger	711ee1a495	Remove GitHub proxy service (#56485 ) This service is being replaced by a redsync.Mutex that lives directly in the GitHub client. By this change we will: - Simplify deployments by removing one service - Centralize GitHub access control in the client instead of splitting it across services - Remove the dependency on a non-HA service to talk to GitHub.com successfully Other repos referencing this service will be updated once this has shipped to dotcom and proven to work over the course of a couple days.	2023-09-14 19:43:40 +02:00
Manuel Ucles	2892eba932	monitoring: update otel dashboard metrics (#55899 ) * add metrics to OTEL * monitoring: update otel dashboard metrics * create a seperate group for Queue Length * Update monitoring/definitions/otel_collector.go Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com> * update auto gen doc * update auto gen alerts doc --------- Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>	2023-08-17 01:03:12 +00:00
Jean-Hadrien Chabran	f7703a4552	sg: ease transition to bazel handled go:generate (#55200 ) * sg: gen.. go also runs //dev:write_all_generated * doc: comment in docs now points to the right cmd * sg: deprecate sg run monitoring-generator	2023-08-14 23:17:25 +02:00
Thorsten Ball	e6323ee6e4	Update o11y ownership from IAM/repo-mgmt to Source (#54368 ) Does what it says in the title. ## Test plan - Ran `sg generate`	2023-06-28 14:49:16 +01:00
Milan Freml	dbaa89269b	chore: remove unused metrics for perms syncer (#50605 ) ## Description These are no longer published from anywhere in the code, so it's useles to keep the definitions. Actually got here by seeing linter complaining about unused functions. ## Test plan not really much to test, mostly removal of code	2023-04-13 19:37:59 +02:00
Alex Ostrikov	a5e0970534	chore: add monitoring doc (#49284 ) Test plan: CI.	2023-03-15 12:55:16 +04:00
William Bezuidenhout	a070f8e3a2	lint fix (#49281 ) Fixes the lint error on main ## Test plan sg lint --fix <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles -->	2023-03-14 05:28:29 +00:00
Idan Varsano	2f511740d7	Small correction on the blob_load_latency alert (#49266 ) ## Test plan doc change	2023-03-13 22:44:39 +00:00
Jason Hawk Harris	524f3e1061	add better description to blob_load_latency (#49259 ) ## Test plan eye test <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles -->	2023-03-13 21:51:52 +00:00
Robert Lin	7706898fec	monitoring: downgrade commit_graph staleness critical alert to warning (#49059 ) This alert has paged the Cloud team on several occasions, and doesn't seem immediately actionable - see https://github.com/sourcegraph/customer/issues/1988 and related discussions. tl;dr it seems that this most often happens when a few repos are too large or take too long to clone or work on, which isn't actionable enough to warrant a critical alert (all the recent occurrences indicate none of the relevant services are starved for resources) - this PR proposes that we downgrade it to a warning. ## Test plan CI	2023-03-10 01:43:15 +00:00
Rok Novosel	b37650e665	embeddings: container metrics (#48897 ) Adds basic resource metrics for embeddings container. ## Test plan * Manual testing.	2023-03-08 10:42:10 +01:00
Stefan Hengl	6bcdf37544	alert: mention shard merging in alert's next steps (#47895 ) Shard merging reduces the number of mmapped files and is a good option to resolve this alert. ## Test plan - just a copy change <!-- All pull requests REQUIRE a test plan: https://docs.sourcegraph.com/dev/background-information/testing_principles -->	2023-02-24 10:07:23 +01:00
Keegan Carruthers-Smith	979f490a36	monitoring: vastly improved searcher dashboard (#47654 ) While investigating searcher metrics I noticed our dashboard for it pretty much has no information vs the wealth of metrics we record. As such I spiked an hour this morning adding dashboards to grafana. For context see these notes https://gist.github.com/keegancsmith/9e53a1df12b5b863249c59539c0410fd Test Plan: Ran monitoring stack against production prometheus. Note: I wasted a lot of time trying to get this work on linux, but it seems a bunch of our stuff is broken there for local dev + grafana. So this only worked on my macbook. kubectl --namespace=prod port-forward svc/prometheus 9090:30090 sg run grafana monitoring-generator	2023-02-15 18:05:14 +02:00
Michael Lin	c174547b56	monitoring: disable executors rate critial error (#47425 )	2023-02-06 21:24:26 +01:00
David Veszelovszki	e0f2e25c14	Regenerate two md files (#47183 )	2023-01-31 14:25:45 +01:00
David Veszelovszki	68ed0abf80	Docs: Replace hyphens in text with em-dashes (#42367 )	2023-01-31 13:18:49 +01:00
Robert Lin	40eedcac19	monitoring: extract into a submodule (#45786 ) This change extracts `monitoring` into a submodule for import in `sourcegraph/controller` (https://github.com/sourcegraph/controller/pull/195) so that we can generate dashboards for Cloud instances. These steps were required: 1. Initialize a `go.mod` in `monitoring` 2. Extract `dev/sg/internal/cliutil` into `lib` to avoid illegal imports from `monitoring` 3. Add local replaces to both `sourcegraph/sourcegraph` and `monitoring` 4. `go mod tidy` on all submodules 5. Update `go generate ./monitoring` commands to use `sg`, since the `go generate` command no longer works 6. Update `grafana/build.sh`, `prometheus/build.sh` to build the submodule 7. Amend linters to check for multiple `go.mod` files and ban imports of `github.com/sourcegraph/sourcegraph` 8. Update `sg generate go` to run in directories rather than from root The only caveat is that if you use VS code, you will now need to open `monitoring` in a separate workspace or similar, like with `lib`. Co-authored-by: Joe Chen <joe@sourcegraph.com>	2022-12-19 17:49:25 +00:00
William Bezuidenhout	6c7389f37c	otel: add collector dashboard (#45009 ) * add initial dashboard for otel * add failed sent dashboard * extra panels * use sum and rate for resource queries * review comments * add warning alerts * Update monitoring/definitions/otel_collector.go * review comments * run go generate * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * Update monitoring/definitions/otel_collector.go Co-authored-by: Robert Lin <robert@bobheadxi.dev> * review comments * review feedback also drop two panels * remove brackets in metrics * update docs * fix goimport * gogenerate Co-authored-by: Robert Lin <robert@bobheadxi.dev> Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>	2022-12-19 13:18:51 +01:00
Robert Lin	c77aa64e0f	monitoring: update email panels, add multi-instance email panel (#44998 )	2022-12-01 11:50:16 -08:00
Geoffrey Gilmore	983549d654	monitoring: zoekt - lower alert thresholds for memory map areas by 10% (#44376 )	2022-11-28 18:27:34 +00:00
Robert Lin	381d171872	monitoring/gitserver: do not alert before janitor threshold (#44768 ) Right now, we get a critical alert _before_ the janitor kicks in to enforce the default `SRC_REPOS_DESIRED_PERCENT_FREE`. A critical alert should only fire when the instance is in a critical state, but here the system may recover still by evicting deleted repositories, so we update the thresholds on `disk_space_remaining` such that: 1. warning fires when _approaching_ the default `SRC_REPOS_DESIRED_PERCENT_FREE` 2. critical fires if we surpass the default `SRC_REPOS_DESIRED_PERCENT_FREE` and gitserver is unable to recover in a short time span	2022-11-23 11:35:11 -08:00
Naman Kumar	4345944f0a	[Permissions Syncs] add new metrics tracking (#43736 )	2022-11-09 22:45:54 +05:30
Noah S-C	dc1ef37e63	codeintel: add variable to service dashboards to filter by 'app' (#43865 )	2022-11-03 15:54:37 +00:00
Robert Lin	e72981edc2	monitoring: add email delivery monitoring (#43281 )	2022-10-31 17:10:03 +01:00

1 2

67 Commits