With this patch, the `errors.HasType` API behaves similar to `Is` and `As`,
where it checks the full error tree instead of just checking a linearized version
of it, as cockroachdb/errors's `HasType` implementation does not respect
multi-errors.
As a consequence, a bunch of relationships between HasType and Is/As that
you'd intuitively expect to hold are now true; see changes to `invariants_test.go`.
I think finding the right permissions confuses people pretty often when
first interacting with MSP. This adds a helper for annotating errors
returned from points where we might be able to help out @DaedalusG,
specifically for the situation in
https://sourcegraph.slack.com/archives/C05GJPTSZCZ/p1717629546727829😉
## Test plan
It's a little wordy but:
```
sg msp pg connect sams prod
❌ possible permissions error, ensure you have the prerequisite Entitle grants mentioned in https://sourcegraph.notion.site/3e59b9ac3d414a5f8fb5911eed1e418a: find IAM output: gcloud: failed to access secret "iam_operator_access_service_account" from "sams-prod-ywuz": rpc error: code = PermissionDenied desc = Permission 'secretmanager.versions.access' denied for resource 'projects/sams-prod-ywuz/secrets/iam_operator_access_service_account/versions/latest' (or it may not exist).
```
## Changelog
- `sg msp pg connect` will tell you about your service's generated
Notion page if you run into a permissions-looking error during command
setup, where there is guidance about the required Entitle requests.
1. The dashboard link still points to the old `go/msp-ops/...` which no
longer work (CORE-105)
2. Alerts defined on top of the MSP defaults are probably of more
interest, so let's sort these in front of the others
## Test plan
Unit/golden tests
Sandbox escapes be-gone
## Test plan
Tested in CI and locally with `bazel build //client/...` as well as a
lot of blood, sweat n tears tearing through failed sandboxes
## Changelog
During testing I found that sometimes some hooks would just hang and not
complete. In this PR we execute all hooks within a timeout context.
Ensuring we give _some_ time for hooks to execute but also making sure
we eventually exit if some hook is misbehaving.
Additional changes:
- Global timeout for all hook execution is 2 seconds
- We hard exit after 5 intterupts instead of 2
- Hooks are split into two groups: sequential and concurrent. As per
their names the hooks are executed differently depending how they were
registered.
## Test plan
Tested locally
```
^C⚠️ Interrupt received, executing hook groups for graceful shutdown...
⚠️ Executing 16 'cleanup' hooks for graceful shutdown...
[ repo-updater] INFO repo-updater.repo-updater.grpcserver grpcserver/grpcserver.go:76 Shutting down gRPC server
[ repo-updater] INFO sync_worker workerutil/worker.go:252 Shutting down dequeue loop {"name": "repo_sync_worker", "reason": ""}
worker stopped due to context error: context canceled
gitserver-1 stopped due to context error: context canceled
searcher stopped due to context error: context canceled
gitserver-0 stopped due to context error: context canceled
blobstore stopped due to context error: context canceled
symbols stopped due to context error: context canceled
caddy stopped due to context error: context canceled
repo-updater stopped due to context error: context canceled
embeddings stopped due to context error: context canceled
frontend stopped due to context error: context canceled
zoekt-index-0 stopped due to context error: context canceled
syntax-highlighter stopped due to context error: context canceled
zoekt-web-1 stopped due to context error: context canceled
web stopped due to context error: context canceled
zoekt-web-0 stopped due to context error: context canceled
⚠️ Executing 6 'general' hooks for for graceful shutdown...
❌ failed to run zoekt-index-1.
stderr:
INFO server zoekt-sourcegraph-indexserver/main.go:1017 removing tmp dir {"tmpRoot": "/Users/william/.sourcegraph/zoekt/index-1/.indexserver.tmp"}
2024/06/04 09:15:03 updating index 6 github.com/sourcegraph/sourcegraph@HEAD=e55003da894490122546f876452f651aae65bb55 reason=content-mismatch
INFO server zoekt-sourcegraph-indexserver/main.go:432 updated index {"repo": "github.com/sourcegraph/sourcegraph", "id": 6, "branches": ["HEAD=e55003da894490122546f876452f651aae65bb55"], "duration": "19.21403925s"}
```
## Changelog
- Hard exit sg when 5 intterupt hooks are received
- Respect the context while executing interrupt hooks to ensure we still
exit if some hook is misbehaving
The GCP monitoring alert configuration expects, for some reason, a
single-line PromQL query only, otherwise the threshold doesn't work. In
configuration, however, we may want to write a multi-line query, for
ease of readability. This change automatically flattens the PromQL query
into a single line and strips extra spaces.
Part of CORE-161
## Test plan
Unit tests
Bumps to rules_js (and friends) to 2.0 RCs.
This brings in performance improvements for analysis phase since npm package depsets and now much smaller. It also adds support for pnpm v9 and allows for linking js_library targets as 1p deps instead of npm_package targets. See https://github.com/aspect-build/rules_js/issues/1671 for more details.
## Test plan
CI
## Changelog
Deleting Notion pages takes a very long time, and is prone to breaking in the page deletion step, where we must delete blocks one at a time because Notion does not allow for bulk block deletions. The errors seem to generally just be random Notion internal errors. This is very bad because it leaves go/msp-ops pages in an unusable state.
To try and mitigate, we add several places to blindly retry:
1. At the Notion SDK level, where a config option is available for retrying 429 errors
2. At the "reset page" helper level, where a failure to reset a page will prompt a retry of the whole helper
3. At the "delete blocks" helper level, where individual block deletion failures will be retried
Attempt to mitigate https://linear.app/sourcegraph/issue/CORE-119
While here, I also made some other QOL tweaks:
- Fix timing of sub-tasks in CLI output
- Bump default concurrency to 5 (our retries will handle if this is too aggressive, hopefully)
- Fix a missing space in generated docs
## Test plan
```
sg msp ops generate-handbook-pages
```
Follow-ups for #62885:
- Better docstrings for `mql`, `promql`
- `duration` -> `durationMinutes` to align with other config
- `alertpolicy.ResponseCodeMetric` -> `spec.CustomAlertCondition`: they're effectively the same type
Test plan: CI
In a rushed POC of MSP jobs, I did some pretty bad copy-pasting (evidenced by all the service-specific docstrings I have removed in this PR) and made a bad configuration decision here, resulting in a few issues:
1. `schedule.deadline` is not actually applied to Cloud Run jobs, causing jobs to time out earlier than desired
2. `schedule.deadline` is not the right place to configure a deadline, because _all_ jobs need a configurable deadline, not just those with schedules. This change moves `schedule.deadline` to `deadlineSeconds`.
Closes CORE-145
## Test plan
```
$ sg msp generate gatekeeper prod
$ git diff
```
```diff
diff --git a/services/gatekeeper/service.yaml b/services/gatekeeper/service.yaml
index fd6a3812..ce4b02e3 100644
--- a/services/gatekeeper/service.yaml
+++ b/services/gatekeeper/service.yaml
@@ -48,4 +48,4 @@ environments:
- "primary"
schedule:
cron: 0 * * * *
- deadline: 1800 # 30 minutes
+ deadlineSeconds: 1800 # 30 minutes
diff --git a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
index 3c2c295e..f83b32b9 100644
--- a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
+++ b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
@@ -281,7 +281,7 @@
},
{
"name": "JOB_EXECUTION_DEADLINE",
- "value": "600s"
+ "value": "1800s"
}
],
"image": "us.gcr.io/sourcegraph-dev/abuse-ban-bot:${var.resolved_image_tag}",
@@ -302,7 +302,7 @@
}
],
"service_account": "${data.terraform_remote_state.cross-stack-reference-input-iam.outputs.cross-stack-output-google_service_accountiam-workload-accountemail}",
- "timeout": "300s",
+ "timeout": "1800s",
"volumes": [
],
"vpc_access": {
@@ -341,7 +341,7 @@
"uniqueId": "job_scheduler"
}
},
- "attempt_deadline": "600s",
+ "attempt_deadline": "1800s",
"depends_on": [
"google_cloud_run_v2_job_iam_member.cloudrun_scheduler_job_invoker"
],
```
## Changelog
- MSP jobs: `schedule.deadline` is deprecated, use the top-level `deadlineSeconds` instead. Configured deadlines are now correctly applied as the Cloud Run job execution timeout as well.
Minor QOL improvement, when you're in the Slack channel the chances are good that you might want the ops docs at some point.
## Test plan
n/a
## Changelog
- MSP-provisioned alerts Slack channels now include a link to the service's generated operational docs for a service (go/msp-ops) in the channel description.
Addresses problem noticed in https://github.com/sourcegraph/managed-services/pull/1486#issuecomment-2137887423
## Test plan
Unit tests
## Changelog
- Fixed an issue with output of `sg msp generate` for MSP jobs with particular schedules changing throughout the week
- MSP jobs schedules now must be between 15 minutes at the most frequent, and every week at the least frequent
Right now, developing SAMS clients involves raw cURL commands (see [operator cheat sheet](https://sourcegraph.notion.site/Sourcegraph-Accounts-infrastructure-operations-b90a571da30443a8b1e7c31ade3594fb)) (which is fine), but other steps like "testing auth" require using [accounts-clients-example](https://github.com/sourcegraph/sourcegraph-accounts/tree/main/cmd/accounts-client-example), which isn't very well documented and requires a bit of hand-wringing to get to and start using.
We previously talked about making a SAMS-specific CLI, but IMO that's a pretty big pain point if we want SAMS integration adoption when everything else lives in `sg`, and all the nice tooling lives here as well.
This PR migrates the next steps after using cURL to set up clients (`create-client-token` and `introspect-token`) from [accounts-clients-example](https://github.com/sourcegraph/sourcegraph-accounts/tree/main/cmd/accounts-client-example) to a new `sg sams` toolchain for better DX (docs, completions, flags)
## Test plan
```sh
export SG_SAMS_CLIENT_ID="..."
export SG_SAMS_CLIENT_SECRET="..."
sg sams create-client-token -s 'enterprise_portal::codyaccess::read'
```
---------
Co-authored-by: Joe Chen <joe@sourcegraph.com>
* sg: add cloud ephemeral dashboard command
`sg cloud ephemeral dashboard` will open the ops dashboard
`sg cloud ephemeral ops --name <instance>` will open the ops page for the given
instance
Currently, we provide single-file tools such as `ctags`, `gsutil` etc via an `sh_binary` wrapper, to have a single target to reference that automatically does platform selection of the underlying tool.
Due to some [unfortunate reason](https://github.com/bazelbuild/bazel/issues/11820), the underlying srcs (which is [a single file](https://bazel.build/reference/be/shell#sh_binary.srcs)) of an `sh_binary` are also exposed as outputs (rather than just as typical runfiles) alongside the script that wraps. This is _sometimes_ problematic when doing location expansion (e.g. `$(location ...)`) due to these only allowing a single output (dont ask why this works in some contexts but not others, I dont know).
To address this, we create a wrapper macro + rule to replicate what we want from `sh_binary` (automatic platform selection + tool naming), while only exposing a singular file.
See example of currently required approach to consuming a tool: [BUILD.bazel](https://github.com/sourcegraph/sourcegraph/pull/62801/files#diff-e2a562c2e13908933b2ee24f0ac596829b38a5325cc69a4aee05c383aaa2e494R8) & [main_test.go](https://github.com/sourcegraph/sourcegraph/pull/62801/files#diff-7a91cb5143064bfc8993ef97baf68b718ef49747ccc1d3c5e1150d4696b88305R66).
With this change, `rlocationpath` (singular) can be used instead (or any of the other singular nouns in different contexts), as well as no `strings.Split/strings.Fields` being required
## Test plan
`bazel cquery --output=files //dev/tools:dropdb` yields 1 vs 2 files.
Also updated the rule behind `//internal/database:generate_schemas` due to the workaround in it for the fact that the underlying srcs was also exposed. The correctness is verified by running said target (locally + CI)
This PR is a result/followup of the improvements we've made in the [SAMS repo](https://github.com/sourcegraph/sourcegraph-accounts/pull/199) that allows call sites to pass down a context (primarily to indicate deadline, and of course, cancellation if desired) and collects the error returned from `background.Routine`s `Stop` method.
Note that I did not adopt returning error from `Stop` method because I realize in monorepo, the more common (and arguably the desired) pattern is to hang on the call of `Start` method until `Stop` is called, so it is meaningless to collect errors from `Start` methods as return values anyway, and doing that would also complicate the design and semantics more than necessary.
All usages of the the `background.Routine` and `background.CombinedRoutines` are updated, I DID NOT try to interpret the code logic and make anything better other than fixing compile and test errors.
The only file that contains the core change is the [`lib/background/background.go`](https://github.com/sourcegraph/sourcegraph/pull/62136/files#diff-65c3228388620e91f8c22d91c18faac3f985fc67d64b08612df18fa7c04fafcd).