Commit Graph

3595 Commits

Author SHA1 Message Date
Noah S-C
bb178ba729
chore(tooling): bump Go version to 1.22.4 (#63124)
Bump for @evict 

## Test plan

CI passes with no complaints

## Changelog

- Bumped version of Go used to build to 1.22.4
2024-06-06 15:19:03 +00:00
Varun Gandhi
2955bb6cfb
chore: Change errors.HasType to respect multi-errors (#63024)
With this patch, the `errors.HasType` API behaves similar to `Is` and `As`,
where it checks the full error tree instead of just checking a linearized version
of it, as cockroachdb/errors's `HasType` implementation does not respect
multi-errors.

As a consequence, a bunch of relationships between HasType and Is/As that
you'd intuitively expect to hold are now true; see changes to `invariants_test.go`.
2024-06-06 13:02:14 +00:00
Robert Lin
6302955caf
feat/sg-msp-pg: add suggestion to check msp-ops page on perms error (#63118)
I think finding the right permissions confuses people pretty often when
first interacting with MSP. This adds a helper for annotating errors
returned from points where we might be able to help out @DaedalusG,
specifically for the situation in
https://sourcegraph.slack.com/archives/C05GJPTSZCZ/p1717629546727829 😉

## Test plan

It's a little wordy but:

```
sg msp pg connect sams prod
 possible permissions error, ensure you have the prerequisite Entitle grants mentioned in https://sourcegraph.notion.site/3e59b9ac3d414a5f8fb5911eed1e418a: find IAM output: gcloud: failed to access secret "iam_operator_access_service_account" from "sams-prod-ywuz": rpc error: code = PermissionDenied desc = Permission 'secretmanager.versions.access' denied for resource 'projects/sams-prod-ywuz/secrets/iam_operator_access_service_account/versions/latest' (or it may not exist).
```

## Changelog

- `sg msp pg connect` will tell you about your service's generated
Notion page if you run into a permissions-looking error during command
setup, where there is guidance about the required Entitle requests.
2024-06-05 18:55:59 -07:00
James Cotter
bcc4367f86
msp/deploy: add 'author' and 'commit_message' annotations (#63108)
Add 'author' and 'commit_message' annotations on release

## Test plan
CI
2024-06-05 11:43:02 -07:00
Robert Lin
27211dea73
feat/msp: update handbook link in alerts dashboard, sort custom alerts first (#63089)
1. The dashboard link still points to the old `go/msp-ops/...` which no
longer work (CORE-105)
2. Alerts defined on top of the MSP defaults are probably of more
interest, so let's sort these in front of the others

## Test plan

Unit/golden tests
2024-06-05 09:09:22 -07:00
Noah S-C
4a93f29755
chore(bazel): enable rules_esbuild sandbox with object-inspect workaround (#61969)
Sandbox escapes be-gone

## Test plan

Tested in CI and locally with `bazel build //client/...` as well as a
lot of blood, sweat n tears tearing through failed sandboxes

## Changelog
2024-06-05 15:34:29 +01:00
William Bezuidenhout
605b2305eb
chore(sg): move registry list cmd to release list (#63094)
Follow up from https://github.com/sourcegraph/sourcegraph/pull/63079

## Test plan
Tested locally

## Changelog
2024-06-05 10:25:38 +02:00
William Bezuidenhout
e4eec6668a
feat(sg): respect the context when executing interrupt hooks (#63069)
During testing I found that sometimes some hooks would just hang and not
complete. In this PR we execute all hooks within a timeout context.
Ensuring we give _some_ time for hooks to execute but also making sure
we eventually exit if some hook is misbehaving.

Additional changes:
- Global timeout for all hook execution is 2 seconds
- We hard exit after 5 intterupts instead of 2
- Hooks are split into two groups: sequential and concurrent. As per
their names the hooks are executed differently depending how they were
registered.


## Test plan
Tested locally

```
^C⚠️ Interrupt received, executing hook groups for graceful shutdown...
⚠️ Executing 16 'cleanup' hooks for graceful shutdown...
[   repo-updater] INFO repo-updater.repo-updater.grpcserver grpcserver/grpcserver.go:76 Shutting down gRPC server
[   repo-updater] INFO sync_worker workerutil/worker.go:252 Shutting down dequeue loop {"name": "repo_sync_worker", "reason": ""}
worker stopped due to context error: context canceled
gitserver-1 stopped due to context error: context canceled
searcher stopped due to context error: context canceled
gitserver-0 stopped due to context error: context canceled
blobstore stopped due to context error: context canceled
symbols stopped due to context error: context canceled
caddy stopped due to context error: context canceled
repo-updater stopped due to context error: context canceled
embeddings stopped due to context error: context canceled
frontend stopped due to context error: context canceled
zoekt-index-0 stopped due to context error: context canceled
syntax-highlighter stopped due to context error: context canceled
zoekt-web-1 stopped due to context error: context canceled
web stopped due to context error: context canceled
zoekt-web-0 stopped due to context error: context canceled
⚠️ Executing 6 'general' hooks for for graceful shutdown...
 failed to run zoekt-index-1.
stderr:
INFO server zoekt-sourcegraph-indexserver/main.go:1017 removing tmp dir {"tmpRoot": "/Users/william/.sourcegraph/zoekt/index-1/.indexserver.tmp"}
2024/06/04 09:15:03 updating index 6 github.com/sourcegraph/sourcegraph@HEAD=e55003da894490122546f876452f651aae65bb55 reason=content-mismatch
INFO server zoekt-sourcegraph-indexserver/main.go:432 updated index {"repo": "github.com/sourcegraph/sourcegraph", "id": 6, "branches": ["HEAD=e55003da894490122546f876452f651aae65bb55"], "duration": "19.21403925s"}
```


## Changelog
- Hard exit sg when 5 intterupt hooks are received
- Respect the context while executing interrupt hooks to ensure we still
exit if some hook is misbehaving
2024-06-05 10:06:58 +02:00
Robert Lin
a3fe573b59
fix/msp: flatten custom alert promQL query for GCP (#63084)
The GCP monitoring alert configuration expects, for some reason, a
single-line PromQL query only, otherwise the threshold doesn't work. In
configuration, however, we may want to write a multi-line query, for
ease of readability. This change automatically flattens the PromQL query
into a single line and strips extra spaces.

Part of CORE-161

## Test plan

Unit tests
2024-06-04 14:37:51 -07:00
William Bezuidenhout
8f3a9d5260
sg: add command to fetch versions from release registry (#63079)
added a commnad to list versions from the release registry
2024-06-04 17:42:47 +02:00
William Bezuidenhout
9bbfd25fc4
feat(sg: add list-build subcommand to ci (#63071)
* sg: add `list-build` subcommand to ci

Add command to list builds in various states on a pipeline

* bazel

remove trailing '...' from commit printing
2024-06-04 13:41:44 +02:00
Greg Magolan
2d3d918ffa
chore(bazel): upgrade to rules_js 2.0 RC (#63022)
Bumps to rules_js (and friends) to 2.0 RCs.

This brings in performance improvements for analysis phase since npm package depsets and now much smaller. It also adds support for pnpm v9 and allows for linking js_library targets as 1p deps instead of npm_package targets. See https://github.com/aspect-build/rules_js/issues/1671 for more details.

## Test plan

CI

## Changelog
2024-06-04 11:26:42 +00:00
William Bezuidenhout
1a7e1b9686
build-tracker: remove old links (#63065) 2024-06-04 12:03:58 +01:00
Robert Lin
908d7119ea
chore/msp: blindly retry Notion page deletion (#63052)
Deleting Notion pages takes a very long time, and is prone to breaking in the page deletion step, where we must delete blocks one at a time because Notion does not allow for bulk block deletions. The errors seem to generally just be random Notion internal errors. This is very bad because it leaves go/msp-ops pages in an unusable state.

To try and mitigate, we add several places to blindly retry:

1. At the Notion SDK level, where a config option is available for retrying 429 errors
2. At the "reset page" helper level, where a failure to reset a page will prompt a retry of the whole helper
3. At the "delete blocks" helper level, where individual block deletion failures will be retried

Attempt to mitigate https://linear.app/sourcegraph/issue/CORE-119

While here, I also made some other QOL tweaks:

- Fix timing of sub-tasks in CLI output
- Bump default concurrency to 5 (our retries will handle if this is too aggressive, hopefully)
- Fix a missing space in generated docs

## Test plan

```
sg msp ops generate-handbook-pages   
```
2024-06-03 22:32:06 +00:00
Joe Chen
dd8ff6013f
worker: add SAMS notifications subscriber (#63051)
Part of CORE-92

This PR add a new worker for subscribing to [SAMS notifications](https://www.notion.so/sourcegraph/SAMS-notifications-distribution-system-0d174480e0044b05b545d37d24263d5a). The current use case is to automatically (hard-)delete users on Sourcegraph.com when the corresponding user is deleted from SAMS. 

This worker is only started when running in the Sourcegraph.com mode and the credentials file (`service_account.json`) is provided, which has been configured since https://github.com/sourcegraph/deploy-sourcegraph-cloud/pull/18591.

Co-authored-by: Robert Lin <robert@bobheadxi.dev>
2024-06-03 18:01:19 -04:00
Robert Lin
617d2f766c
chore/msp/spec: tidy up custom alerts spec (#63050)
Follow-ups for #62885:

- Better docstrings for `mql`, `promql`
- `duration` -> `durationMinutes` to align with other config
- `alertpolicy.ResponseCodeMetric` -> `spec.CustomAlertCondition`: they're effectively the same type

Test plan: CI
2024-06-03 13:53:01 -07:00
Bolaji Olajide
9e2b56119f
feat(release): allow creation of multiple patch release events (#63034)
* allow creation of multiple patch release events

* skip old month releases

* update config
2024-06-03 11:14:24 -04:00
Bolaji Olajide
bab01ccaac
feat(release): rename code freeze event to branch cut event (#63033)
rename code freeze event to branch cut
2024-06-03 05:13:32 -05:00
William Bezuidenhout
4cf94e9e8c
sg: speed up interrupt execution (#63032) 2024-06-03 09:54:51 +00:00
Greg Magolan
a3afa08161
chore(bazel): bump to aspect_bazel_lib 2.7.7 (#63012) 2024-05-31 23:08:52 +01:00
Robert Lin
012db75133
fix/msp: make deadlineSeconds job-level configuration, apply in timeout (#63017)
In a rushed POC of MSP jobs, I did some pretty bad copy-pasting (evidenced by all the service-specific docstrings I have removed in this PR) and made a bad configuration decision here, resulting in a few issues:

1. `schedule.deadline` is not actually applied to Cloud Run jobs, causing jobs to time out earlier than desired
2. `schedule.deadline` is not the right place to configure a deadline, because _all_ jobs need a configurable deadline, not just those with schedules. This change moves `schedule.deadline` to `deadlineSeconds`.

Closes CORE-145

## Test plan

```
$ sg msp generate gatekeeper prod
$ git diff
```

```diff                    
diff --git a/services/gatekeeper/service.yaml b/services/gatekeeper/service.yaml
index fd6a3812..ce4b02e3 100644
--- a/services/gatekeeper/service.yaml
+++ b/services/gatekeeper/service.yaml
@@ -48,4 +48,4 @@ environments:
           - "primary"
     schedule:
       cron: 0 * * * *
-      deadline: 1800 # 30 minutes
+    deadlineSeconds: 1800 # 30 minutes
diff --git a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
index 3c2c295e..f83b32b9 100644
--- a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
+++ b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
@@ -281,7 +281,7 @@
                   },
                   {
                     "name": "JOB_EXECUTION_DEADLINE",
-                    "value": "600s"
+                    "value": "1800s"
                   }
                 ],
                 "image": "us.gcr.io/sourcegraph-dev/abuse-ban-bot:${var.resolved_image_tag}",
@@ -302,7 +302,7 @@
               }
             ],
             "service_account": "${data.terraform_remote_state.cross-stack-reference-input-iam.outputs.cross-stack-output-google_service_accountiam-workload-accountemail}",
-            "timeout": "300s",
+            "timeout": "1800s",
             "volumes": [
             ],
             "vpc_access": {
@@ -341,7 +341,7 @@
             "uniqueId": "job_scheduler"
           }
         },
-        "attempt_deadline": "600s",
+        "attempt_deadline": "1800s",
         "depends_on": [
           "google_cloud_run_v2_job_iam_member.cloudrun_scheduler_job_invoker"
         ],
```

## Changelog

- MSP jobs: `schedule.deadline` is deprecated, use the top-level `deadlineSeconds` instead. Configured deadlines are now correctly applied as the Cloud Run job execution timeout as well.
2024-05-31 21:15:31 +00:00
Greg Magolan
bbae7a4954
build(bazel): bump to rules_esbuild 0.16.0 (#63005)
* build(bazel): pin bazel fetched esbuild version to 0.19.2

* build(bazel): bump to rules_esbuild 0.16.0

* Update WORKSPACE

Co-authored-by: Noah S-C <noah@sourcegraph.com>

---------

Co-authored-by: Noah S-C <noah@sourcegraph.com>
2024-05-31 11:20:23 -07:00
Robert Lin
7170d4bd2b
feat/msp: add link to ops page in Slack channel description (#63011)
Minor QOL improvement, when you're in the Slack channel the chances are good that you might want the ops docs at some point.

## Test plan

n/a

## Changelog

- MSP-provisioned alerts Slack channels now include a link to the service's generated operational docs for a service (go/msp-ops) in the channel description.
2024-05-31 17:59:22 +00:00
William Bezuidenhout
0fcffdd657
fix(sg): cloud eph - do not fail just because we cannot parse reason (#62989)
* do not fail just because we cannot parse reason

* fix tests

* whitespace
2024-05-31 14:18:56 +00:00
Noah S-C
e1974fe9f5
chore(bazel): update ownership tags to increase coverage (#63001)
Brings us up to 73%, a bit of buffer room


## Test plan

`./dev/check-test-ownership.sh` prints out 73


## Changelog
2024-05-31 14:10:29 +00:00
Noah S-C
79fce8c73e
feat(ci): add GHA to report when Bazel test ownership drops below 70% threshold (#62985)
This PR adds a non-blocking GHA check to report when a branch's Bazel test ownership drops below 70%. See example messaging below to see how it looks like: https://github.com/sourcegraph/sourcegraph/pull/62985#issuecomment-2139439084. The message will be updated if the threshold is reached/breached whenever the branch changes.

## Test plan

Extensive iteration in this PR, see below message https://github.com/sourcegraph/sourcegraph/pull/62985#issuecomment-2139439084


## Changelog
2024-05-31 14:46:01 +01:00
William Bezuidenhout
bc73643a5d
chore(sg): cloud ephemeral fix instance check (#62988)
* fix instance check

* fix name sanitization
2024-05-31 13:39:46 +00:00
James Cotter
2f4e3b9272
sg/msp: fix nil domain and EnvironmentDomainTypeNone in diagram gen (#62982) 2024-05-30 17:58:00 +01:00
Robert Lin
9e4a8e8033
feat/sg/msp: add 'sg msp validate' for validating service specifications (#62973) 2024-05-30 09:11:36 -07:00
Robert Lin
27f0d725ac
feat/msp/spec: require notionPageID if a production env is provisioned (#62972) 2024-05-30 09:01:42 -07:00
Robert Lin
cb62afa2c2
fix/msp: test for cron interval changes based on time, add more restrictions (#62969)
Addresses problem noticed in https://github.com/sourcegraph/managed-services/pull/1486#issuecomment-2137887423

## Test plan

Unit tests

## Changelog

- Fixed an issue with output of `sg msp generate` for MSP jobs with particular schedules changing throughout the week
- MSP jobs schedules now must be between 15 minutes at the most frequent, and every week at the least frequent
2024-05-29 18:24:39 -07:00
Anish Lakhwara
de920065ea
feat(sg): add version=auto for sg release cut (#62970) 2024-05-29 14:57:18 -07:00
Robert Lin
28324a3d95
feat/sg/enterprise-portal: use externalSecret to configure SAMS client secret (#62953) 2024-05-28 15:27:30 -07:00
Robert Lin
de9a31aa89
feat/sg: add 'sg sams' commands 'create-client-token' and 'introspect-token' (#62883)
Right now, developing SAMS clients involves raw cURL commands (see [operator cheat sheet](https://sourcegraph.notion.site/Sourcegraph-Accounts-infrastructure-operations-b90a571da30443a8b1e7c31ade3594fb)) (which is fine), but other steps like "testing auth" require using [accounts-clients-example](https://github.com/sourcegraph/sourcegraph-accounts/tree/main/cmd/accounts-client-example), which isn't very well documented and requires a bit of hand-wringing to get to and start using.

We previously talked about making a SAMS-specific CLI, but IMO that's a pretty big pain point if we want SAMS integration adoption when everything else lives in `sg`, and all the nice tooling lives here as well.

This PR migrates the next steps after using cURL to set up clients (`create-client-token` and `introspect-token`) from [accounts-clients-example](https://github.com/sourcegraph/sourcegraph-accounts/tree/main/cmd/accounts-client-example) to a new `sg sams` toolchain for better DX (docs, completions, flags)

## Test plan

```sh
export SG_SAMS_CLIENT_ID="..."
export SG_SAMS_CLIENT_SECRET="..."
sg sams create-client-token -s 'enterprise_portal::codyaccess::read'
```

---------

Co-authored-by: Joe Chen <joe@sourcegraph.com>
2024-05-28 21:08:42 +00:00
William Bezuidenhout
462ba3de0b
fix(sg): fix error condition for cloud eph deployment that already exists (#62947)
fix error conditon for deploy already existing
2024-05-28 15:25:13 +00:00
William Bezuidenhout
acf051ad66
feat(local): add cloud ephemeral dashboard command (#62945)
* sg: add cloud ephemeral dashboard command

`sg cloud ephemeral dashboard` will open the ops dashboard
`sg cloud ephemeral ops --name <instance>` will open the ops page for the given
instance
2024-05-28 16:12:14 +02:00
Jean-Hadrien Chabran
21b2918ef2
chore(local): catch bazel-do issues before push (#62943) 2024-05-28 15:16:13 +02:00
Will Dollman
d1b71a0a8a
bazel: Cleanup oci_deps.bzl (#62769)
* security: Update dind base image to patch multiple CVEs

Patches CVE-2023-45288 CVE-2024-2511 CVE-2024-32002 CVE-2024-32004 CVE-2024-32020 CVE-2024-32021 CVE-2024-32465

* ci: Tweak automated security update PR title

* Remove unused image hashes from oci_deps

* Tweak oci_deps comment

* Fixup old @wolfi_base references

* Add wolfi_base load

* use the correct base image

* Remove unneeded wolfi_base call
2024-05-28 10:00:31 +01:00
Noah S-C
7009f1dfe4
bazel: add utility macro for wrapping single-file tools (#62930)
Currently, we provide single-file tools such as `ctags`, `gsutil` etc via an `sh_binary` wrapper, to have a single target to reference that automatically does platform selection of the underlying tool. 
Due to some [unfortunate reason](https://github.com/bazelbuild/bazel/issues/11820), the underlying srcs (which is [a single file](https://bazel.build/reference/be/shell#sh_binary.srcs)) of an `sh_binary` are also exposed as outputs (rather than just as typical runfiles) alongside the script that wraps. This is _sometimes_ problematic when doing location expansion (e.g. `$(location ...)`) due to these only allowing a single output (dont ask why this works in some contexts but not others, I dont know). 
To address this, we create a wrapper macro + rule to replicate what we want from `sh_binary` (automatic platform selection + tool naming), while only exposing a singular file.

See example of currently required approach to consuming a tool: [BUILD.bazel](https://github.com/sourcegraph/sourcegraph/pull/62801/files#diff-e2a562c2e13908933b2ee24f0ac596829b38a5325cc69a4aee05c383aaa2e494R8) & [main_test.go](https://github.com/sourcegraph/sourcegraph/pull/62801/files#diff-7a91cb5143064bfc8993ef97baf68b718ef49747ccc1d3c5e1150d4696b88305R66).

With this change, `rlocationpath` (singular) can be used instead (or any of the other singular nouns in different contexts), as well as no `strings.Split/strings.Fields` being required

## Test plan

`bazel cquery --output=files //dev/tools:dropdb` yields 1 vs 2 files.
Also updated the rule behind `//internal/database:generate_schemas` due to the workaround in it for the fact that the underlying srcs was also exposed. The correctness is verified by running said target (locally + CI)
2024-05-27 16:53:51 +00:00
William Bezuidenhout
57824e6374
sg: cloud ephemeral - handle multiple job reasons (#62929)
* sg: cloud ephemeral handle multiple job reasons

* update cloud printers to show overall job status

* nogo
2024-05-27 18:50:40 +02:00
Jean-Hadrien Chabran
75bd631412
fix(local): panic in sg ci preview (#62928) 2024-05-27 15:06:25 +02:00
James McNamara
69b1bfb4d0
feat(ci): docker-images runtype (#62708)
---------

Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
2024-05-27 14:45:01 +02:00
William Bezuidenhout
1cc764ebb8
chore(sg): cloud ephemeral - account for conclusion field (#62925) 2024-05-27 13:22:14 +02:00
James Cotter
cb71a2d529
sg/msp: support for super-simple alerts on custom metrics (#62885)
---------

Co-authored-by: Joe Chen <joe@sourcegraph.com>
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
2024-05-24 20:47:19 +01:00
James Cotter
d4a6b27403
sg/msp: fix init prompts breaking when encountering whitespace (#62898) 2024-05-24 15:08:35 +01:00
Joe Chen
2589fef13e
lib/background: upgrade Routine interface with context and errors (#62136)
This PR is a result/followup of the improvements we've made in the [SAMS repo](https://github.com/sourcegraph/sourcegraph-accounts/pull/199) that allows call sites to pass down a context (primarily to indicate deadline, and of course, cancellation if desired) and collects the error returned from `background.Routine`s `Stop` method.

Note that I did not adopt returning error from `Stop` method because I realize in monorepo, the more common (and arguably the desired) pattern is to hang on the call of `Start` method until `Stop` is called, so it is meaningless to collect errors from `Start` methods as return values anyway, and doing that would also complicate the design and semantics more than necessary.

All usages of the the `background.Routine` and `background.CombinedRoutines` are updated, I DID NOT try to interpret the code logic and make anything better other than fixing compile and test errors.

The only file that contains the core change is the [`lib/background/background.go`](https://github.com/sourcegraph/sourcegraph/pull/62136/files#diff-65c3228388620e91f8c22d91c18faac3f985fc67d64b08612df18fa7c04fafcd).
2024-05-24 10:04:55 -04:00
William Bezuidenhout
d485d76ee9
sg: cloud use new status reason format (#62881) 2024-05-23 21:04:10 +02:00
Jean-Hadrien Chabran
7c15db348d
chore(rel): fix tests not waiting for push prod (#62089) 2024-05-23 16:15:33 +02:00
William Bezuidenhout
0732f33c2d
sg: cloud eph - api now requires env during list (#62875) 2024-05-23 10:04:53 +00:00
William Bezuidenhout
4f72c222bf
sg: remove debugging printlns (#62854)
remove debugging printlns
2024-05-23 09:46:32 +00:00