Commit Graph

77 Commits

Author SHA1 Message Date
Robert Lin
879646a20e
feat/sg/msp: helpful error on cloudsqlproxy port conflict (#63830)
Ported from https://github.com/sourcegraph/controller/pull/1622 :) 

## Test plan

n/a
2024-07-15 11:32:37 -07:00
James Cotter
117fe09829
sg/msp: generate github action subscription matrix dynamically (#63526)
Currently the matrix is hardcoded in the msp repo. 
Service operators can forget to add or remove their service from the
list.

GitHub supports dynamically generating the matrix from a previous jobs
output
([example](https://josh-ops.com/posts/github-actions-dynamic-matrix/))
This PR adds an `sg msp subscription-matrix` command which will generate
the matrix we need

Part of CORE-202

## Test plan
Output
```
{"service":[{"id":"cloud-ops","env":"prod","category":"internal"},{"id":"gatekeeper","env":"prod","category":"internal"},{"id":"linearhooks","env":"prod","category":"internal"}]}
```
2024-06-27 19:52:01 +01:00
Robert Lin
6302955caf
feat/sg-msp-pg: add suggestion to check msp-ops page on perms error (#63118)
I think finding the right permissions confuses people pretty often when
first interacting with MSP. This adds a helper for annotating errors
returned from points where we might be able to help out @DaedalusG,
specifically for the situation in
https://sourcegraph.slack.com/archives/C05GJPTSZCZ/p1717629546727829 😉

## Test plan

It's a little wordy but:

```
sg msp pg connect sams prod
 possible permissions error, ensure you have the prerequisite Entitle grants mentioned in https://sourcegraph.notion.site/3e59b9ac3d414a5f8fb5911eed1e418a: find IAM output: gcloud: failed to access secret "iam_operator_access_service_account" from "sams-prod-ywuz": rpc error: code = PermissionDenied desc = Permission 'secretmanager.versions.access' denied for resource 'projects/sams-prod-ywuz/secrets/iam_operator_access_service_account/versions/latest' (or it may not exist).
```

## Changelog

- `sg msp pg connect` will tell you about your service's generated
Notion page if you run into a permissions-looking error during command
setup, where there is guidance about the required Entitle requests.
2024-06-05 18:55:59 -07:00
Robert Lin
908d7119ea
chore/msp: blindly retry Notion page deletion (#63052)
Deleting Notion pages takes a very long time, and is prone to breaking in the page deletion step, where we must delete blocks one at a time because Notion does not allow for bulk block deletions. The errors seem to generally just be random Notion internal errors. This is very bad because it leaves go/msp-ops pages in an unusable state.

To try and mitigate, we add several places to blindly retry:

1. At the Notion SDK level, where a config option is available for retrying 429 errors
2. At the "reset page" helper level, where a failure to reset a page will prompt a retry of the whole helper
3. At the "delete blocks" helper level, where individual block deletion failures will be retried

Attempt to mitigate https://linear.app/sourcegraph/issue/CORE-119

While here, I also made some other QOL tweaks:

- Fix timing of sub-tasks in CLI output
- Bump default concurrency to 5 (our retries will handle if this is too aggressive, hopefully)
- Fix a missing space in generated docs

## Test plan

```
sg msp ops generate-handbook-pages   
```
2024-06-03 22:32:06 +00:00
Robert Lin
9e4a8e8033
feat/sg/msp: add 'sg msp validate' for validating service specifications (#62973) 2024-05-30 09:11:36 -07:00
Robert Lin
fdb14b09d7
feat/sg-msp: add more stats to 'fleet' command (#62746)
Nice-to-have summaries for adoption of various features:

1. Services vs Jobs
2. Rollout pipelines
3. Deploy types
4. Resource dependencies
2024-05-22 11:17:53 -07:00
Robert Lin
6c9e620913
sg msp: only generate skaffold assets if last stage of rollouts (#62736)
#62704 introduced a regression due to the changing of the semantics of `rollouts` configuration in code: previously, only the final stage would get it, but with #62704 this became available on all environments, and to infer the final stage a nil-safe helper `rollout.IsFinalStage()` was introduced.

This change fixes a missed check migration that causes additional assets to be incorrectly generated for non-final environments.

## Test plan

`sg msp generate -all`
2024-05-16 10:39:04 -07:00
Noah S-C
9b6ba7741e
bazel: transcribe test ownership to bazel tags (#62664) 2024-05-16 15:51:16 +01:00
Robert Lin
456315b54d
msp/rollouts: use new in-terraform custom target provisioning (#62644)
Closes CORE-23 - this change removes the manual `gcloud deploy apply` step previously required to enable MSP rollouts, thanks to a recent release of the Google Terraform provider.

## Test plan

https://github.com/sourcegraph/managed-services/pull/1403
2024-05-14 18:51:33 -07:00
Robert Lin
2d3b2c29f5
sg msp: add category flag for 'tfc sync' (#62675)
Enables staged rollouts of workspace updates.

## Test plan

```
sg msp tfc sync -all -category=test
```
2024-05-14 18:49:56 -07:00
Robert Lin
fdf0bf9a02
msp/operationdocs: add incident response starter guide, Notion-specific formatting (#62607)
Closes CORE-20: adds a small per-service "incident response" section near the alerts reference section of each service, providing some simple starter context and linking to other relevant guidance.

This change also makes some Notion-oriented formatting tweaks: putting all paragraphs on a single line (because of https://github.com/sourcegraph/notionreposync/issues/9) and also rendering callouts with appropriate background colors (https://github.com/sourcegraph/notionreposync/pull/11).

## Test plan

Golden tests, roll out to Notion:

```sh
GITHUB_ACTIONS=true sg msp ops generate-handbook-pages
```

Incident response:

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/d07e0071-870f-4acb-b4a4-2246b40850a3)

Callouts:

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/6ec7dbea-cafd-40e0-b50c-780c4e9cbd22)
2024-05-10 23:56:41 +00:00
Robert Lin
7b6dd9080e
msp: centralize and expose locations configuration (#62604)
This change adds a `locations: { gcpRegion: "...", gcpLocation: "..." }` configuration to centralize all location-related options. `gcpRegion` specifies regional preferences, while `gcpLocation` specifies multi-regional preferences (for resources that support it - only BigQuery in most cases).

Closes CORE-24 - see issue for some context.

## Test plan

```
sg msp generate -all # no diff
```

```
sg msp schema -output='../managed-services/schema/service.schema.json'
```
2024-05-10 15:50:07 -07:00
Robert Lin
022b4ad95f
msp/terraformcloud: add option to respect existing run mode (#62580)
When using https://github.com/sourcegraph/sourcegraph/pull/62565, we override test environments that are in CLI mode, which can cause infra to be rolled out by surprise via VCS mode on switch - this change adds an option to respect the existing run mode configuration via `-workspace-run-mode=ignore`.

Thread: https://sourcegraph.slack.com/archives/C06JENN2QBF/p1715256898022469?thread_ts=1715251558.736709&cid=C06JENN2QBF

## Test plan

```
sg msp tfc sync -all
👉 Syncing all environments for all services, including setting ALL workspaces to use run mode "vcs" (use '-workspace-run-mode=ignore' to respect the existing run mode) - are you sure? (y/N)  N
 aborting
Projects/sourcegraph/managed-services 1 » sg msp tfc sync -all -workspace-run-mode=ignore
👉 Syncing all environments for all services - are you sure? (y/N)  y
// ...
```
2024-05-09 14:57:40 -07:00
Robert Lin
a4b128f84b
sg msp tfc sync: support applying to all services (#62565) 2024-05-09 10:01:42 -07:00
Robert Lin
e1fa3393b5
msp: update msp-ops links (#62435)
With https://github.com/sourcegraph/sourcegraph/pull/62325 landed, this updates a few references:

1. Service golinks don't work anymore: CORE-105
3. Alerts docs now point to the service Notion page
2024-05-06 10:36:15 -07:00
Robert Lin
fa16c47b9b
msp: prefer env token before setting up GSM client (#62436)
This makes integration easier in CI - see failure in https://github.com/sourcegraph/managed-services/actions/runs/8945233608/job/24573806590
2024-05-06 10:21:07 -07:00
Robert Lin
f444774570
msp/operationdocs: write to Notion instead of Markdown (#62325)
Closes https://github.com/sourcegraph/managed-services/issues/1076 aka closes CORE-28 - https://github.com/sourcegraph/managed-services/pull/1332 also automated the updates.

Notes:

1. Notion anchors are ID-based, so we strip all anchor links because we cannot generate them in one pass. `notionreposync` may implement a mechanism for this in the future (filed https://github.com/sourcegraph/notionreposync/issues/8), but for now we don't have a great way around this, especially because of the next point
2. In-place updates are hard because of the block structure, so we destroy page contents and recreate the page every time. This causes a "flicker" as a viewer may see the page disappear slowly before their eyes (we can only delete things 1 block at a time). `notionreposync` may may implement an improved mechanism for this in the future (filed https://github.com/sourcegraph/notionreposync/issues/7)
3. There's something funky going wrong with line breaks, filed https://github.com/sourcegraph/notionreposync/issues/9 - it hurts readability but it's not unmanageable.
4. Sadly Notion does not allow API file uploads (😡 https://developers.notion.com/docs/working-with-files-and-media#uploading-files-and-media-via-the-notion-api), so we generate them into the managed-services repo (https://github.com/sourcegraph/managed-services/pull/1332) and then just link to those diagrams from the generated page. We use a Markdown file that renders the SVG because the native SVG viewer sucks.
5. Made misc changes to operationdocs output where Notion's version is noticeably worse, or difficult to support (tables, lists-in-admonitions, etc)

Depends on various improvements upstream in https://github.com/sourcegraph/notionreposync:

- https://github.com/sourcegraph/notionreposync/pull/4
- https://github.com/sourcegraph/notionreposync/pull/5
- https://github.com/sourcegraph/notionreposync/pull/6

Follow-up improvements:

- CORE-105
- CORE-106

## Test plan

```
sg msp ops generate-handbook-pages
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/9d314a97-5370-4123-9534-f9f897002110)

https://sourcegraph.notion.site/Build-Tracker-infrastructure-operations-bd66bf25d65d41b4875874a6f4d350cc

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/69e1eb48-2fa9-421b-b2fd-969e25f37fee)

https://github.com/sourcegraph/managed-services/pull/1332

---------

Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
2024-05-03 21:17:16 +00:00
James Cotter
6d7082d26e
sg/msp: architecture diagrams (#62213) 2024-05-01 13:57:34 +01:00
Robert Lin
47298e8312
sg/msp: add service lists to help text, improve completions (#62315)
Despite recent efforts to surface service options in `sg msp` commands better, such as https://github.com/sourcegraph/sourcegraph/pull/61620, it seems "what is a service argument" remains a point of confusion. It's not helped that some of the command help texts are not super helpful.

This change:

1. Adds explicit lists of available services in help text when we can get it
2. Improves service ID completion so that it works in any subdirectory in the managed-services repo

## Test plan

```sh
Projects/sourcegraph/managed-services » cd services
sourcegraph/managed-services/services » sg msp ops -h
NAME:
   sg managed-services-platform operations - Generate operational reference for a service

USAGE:
   sg msp ops [command options] <service ID>

DESCRIPTION:
   Directly view operational reference documentation for a service - also available in go/msp-ops.

   Available services:
   - build-tracker
   - cloud-ops
   - cloud-relay
   - cody-analytics
   - entitler
   - gatekeeper
   - msp-testbed
   - pings
   - releaseregistry
   - sams
   - sourcegraph-accounts
   - support-integration
   - telemetry-gateway

COMMANDS:
   help, h  Shows a list of commands or help for one command

OPTIONS:
   --pretty    Render syntax-highlighed Markdown (default: true)
   --help, -h  show help
sourcegraph/managed-services/services » sg msp ops # <tab>
help  h  -- Shows a list of commands or help for one command                                                                            
build-tracker         cody-analytics        msp-testbed           sams                  telemetry-gateway                         
cloud-ops             entitler              pings                 sourcegraph-accounts                                          
cloud-relay           gatekeeper            releaseregistry       support-integration      
```
2024-04-30 17:14:24 -07:00
Robert Lin
0fb6806f3b
sg/msp: warn user if sg-msp lockfile differs from current version (#62134)
In https://github.com/sourcegraph/managed-services/pull/1288 I'm introducing granular, per-category version locks for https://github.com/sourcegraph/managed-services/issues/599. This change adds warnings that are shown to the user on version mismatches.

This may cause some friction with users, but for now, let's just deal with this on a case-by-case basis - our primary goal right now is to build a process for https://github.com/sourcegraph/managed-services/issues/599 by introducing more granular version locks.

Requires https://github.com/sourcegraph/sourcegraph/pull/62176 so that we can access the same version format as the one used in our lockfiles.

Details on our locking strategy is here: https://sourcegraph.notion.site/Deploying-new-versions-of-MSP-1808e7e45bd54f419dd93af542d99238#58dabe4992754ca18ed39bc212ccbbba

## Test plan

```
sg msp generate -all
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/d5c19821-7a26-4a91-a55e-1de03de2912b)
2024-04-29 17:50:24 -07:00
Robert Lin
3623ecb2cf
sg/msp: filter generated environments by category (#62131)
Part of https://github.com/sourcegraph/managed-services/issues/599

See https://github.com/sourcegraph/managed-services/pull/1288 for how this mechanism will be used.

## Test plan

```sh
sg msp generate -all -category=test
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/d27c5df5-6d13-43fa-96c2-5523da24693a)
2024-04-24 09:44:16 -07:00
James Cotter
738a37c7a9
sg/msp: add alert policy documentation to generated ops pages (#61939) 2024-04-18 19:46:19 +00:00
Robert Lin
7432727f70
sg msp: add interactive version of 'sg msp init' and 'sg msp init-env' (#61697)
As titled - we can now prompt you through service setup. This is a small QOL improvement that makes providing the required parameters a bit easier, as the setup flags experience has been a point of feedback from several MSP adopters.

As a follow-up, we can probably remove the flag-based setup entirely, as it should generally be a human-operator-only setup. Then we can expand the setup to include e.g. resource setup (postgres), and maybe an initial generate step as well.

## Test plan

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/62810f24-bcce-4c8d-8bed-2f6c762ab45a)
2024-04-09 01:32:58 +00:00
Robert Lin
514506de4d
sg/msp: retrieve service/env more consistently, provide possible values in errors (#61620)
This change migrates `generate` and `tfc sync` to use our service/env argument getters so that we return more consistent error messages. Errors around non-existent or missing service/env arguments now also provide relevant lists of possible values, such as all available services or all available environments for a valid service argument (see test plan examples).

Hopefully this makes errors easier to understand, as the possible values should give a better hint as to what arguments the command expects.
2024-04-05 22:05:35 +09:00
Noah S-C
d882ad23ba
msp: add yaml langserver magic comment for schema (#61428)
If opening a generated MSP service file in your editor, but as part of a different workspace/repo, then vscode doesnt pick up what schema file to use for it, resulting in no intellisense. The yaml langserver supports a magic comment to point to a relative file, that works in this case 🙂 

## Test plan

Validated locally with local sg build and `sg msp init ...`
2024-03-27 17:33:15 +00:00
Robert Lin
08bc88e776
msp/rollouts: define custom target type in Terraform (#61366)
MSP rollouts (#59956) currently requires an additional manual step to provision via a `gcloud deploy apply` using a generated configuration YAML file. This is required because at the time, the following were not available via Terraform:

1. Cloud Deploy custom target _types_: define entities in Cloud Deploy describing the existence of custom targets using custom Skaffold scripts.
2. Cloud Deploy targets, _using_ custom target types: the Terraform resource only supports native target types, not custom targets.

In a recent GCP Terraform provider release, support for 1 was added, and this change migrates the definition currently in the generated Cloud Deploy YAML file. However, 2 is not yet supported, so we can't yet remove the manual `gcloud deploy apply` step - this is tracked in https://github.com/sourcegraph/managed-services/issues/940. This PR also improves the docstrings to better indicate what we expect to change in the future.

Closes https://github.com/sourcegraph/managed-services/issues/932

## Test plan

https://github.com/sourcegraph/managed-services/pull/939
2024-03-26 11:34:24 +09:00
Robert Lin
1704ea2bd7
sg msp pg: UX improvements (#61358)
More feedback from recent `sg msp pg` usage, starting with https://sourcegraph.slack.com/archives/C05GJPTSZCZ/p1710932987694719?thread_ts=1709911173.644899&cid=C05GJPTSZCZ:

1. **operationdocs**: Stronger wording on first-time `managed-services` repo and tooling setup, in particular saying you're going to need to clone the repo.
2. **operationdocs**: Note that write-access Entitle is required even for read-only database connection (both cases require IAM impersonation, which _can_ grant write access, so it's gated behind the write-access request)
3. **sg msp**: Throw special error when additional args are provided in commands that don't expect it, reminding users that flags need to be placed before args.
4. **sg msp**: Render warning with link to generated docs if permissions-related error is detected in `cloud-sql-proxy` output.

## Test plan

```
sg msp pg connect sourcegraph-accounts prod --session.timeout foobar
 got unexpected additional arguments "--session.timeout foobar" - note that flags must be placed BEFORE arguments, i.e. '<flags> <arguments>'
```

```
  [cloud-sql-proxy] 2024/03/25 08:06:36 [sourcegraph-accounts-prod-csvc:us-central1:postgresql-e6bc] failed to connect to instance: failed to get instance: Refresh error: failed to get instance metadata (connection name = "sourcegraph-accounts-prod-csvc:us-central1:postgresql-e6bc"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/sourcegraph-accounts-prod-csvc/instances/postgresql-e6bc/connectSettings?alt=json&prettyPrint=false": impersonate: status code 403: {
  [cloud-sql-proxy]   "error": {
  [cloud-sql-proxy]     "code": 403,
  [cloud-sql-proxy]     "message": "Permission 'iam.serviceAccounts.getAccessToken' denied on resource (or it may not exist).",
  [cloud-sql-proxy]     "status": "PERMISSION_DENIED",
  [cloud-sql-proxy]     "details": [
  [cloud-sql-proxy]       {
  [cloud-sql-proxy]         "@type": "type.googleapis.com/google.rpc.ErrorInfo",
  [cloud-sql-proxy]         "reason": "IAM_PERMISSION_DENIED",
  [cloud-sql-proxy]         "domain": "iam.googleapis.com",
  [cloud-sql-proxy]         "metadata": {
  [cloud-sql-proxy]           "permission": "iam.serviceAccounts.getAccessToken"
  [cloud-sql-proxy]         }
  [cloud-sql-proxy]       }
  [cloud-sql-proxy]     ]
  [cloud-sql-proxy]   }
  [cloud-sql-proxy] }
⚠️ Permissions error detected - do you have the prerequisite Entitle permissions grant? See go/msp-ops/sourcegraph-accounts#prod for more details.
```

https://github.com/sourcegraph/handbook/pull/8767 updates the handbook with the new output
2024-03-25 20:50:33 +09:00
James Cotter
ae09144e2f
sg/msp: update CloudDeploy helper comment (#61164) 2024-03-14 16:00:03 -04:00
Robert Lin
68c817a05b
sg msp: improve cloudsqlproxy installation UX (#60984)
Previously, we'd ask users to run the command again with the `-download` flag. This is kind of annoying especially because of flag positioning quirks. Instead, let's just ask the user if they'd actually like us to install it for them, in case we don't find the binary in the cache.

## Test plan

```sh
$ sg msp pg connect sams dev  
⚠️ cloud-sql-proxy binary not found at "/Users/robert@sourcegraph.com/Library/Caches/sourcegraph/bin/cloud-sql-proxy/2.8.1/cloud-sql-proxy"
👉 Would you like me to install cloud-sql-proxy for you? n
 failed to find cloud-sql-proxy: stat /Users/robert@sourcegraph.com/Library/Caches/sourcegraph/bin/cloud-sql-proxy/2.8.1/cloud-sql-proxy: no such file or directory

$ sg msp pg connect sams dev
⚠️ cloud-sql-proxy binary not found at "/Users/robert@sourcegraph.com/Library/Caches/sourcegraph/bin/cloud-sql-proxy/2.8.1/cloud-sql-proxy"
👉 Would you like me to install cloud-sql-proxy for you? y
 cloud-sql-proxy binary saved to "/Users/robert@sourcegraph.com/Library/Caches/sourcegraph/bin/cloud-sql-proxy/2.8.1/cloud-sql-proxy"
💡 Preparing a connection with read-only access - for write access, use the '-write-access' flag.
👉 Use this command to connect to database "accounts":
                                                                                                                   
psql -U operatoraccess-a55c85@sams-dev-bfec.iam -d accounts -h 127.0.0.1 -p 5433                                   

👉 Use this command to connect to database "cody_management":
                                                                                                                   
psql -U operatoraccess-a55c85@sams-dev-bfec.iam -d cody_management -h 127.0.0.1 -p 5433                            

⚠️ The current session will terminate in 300 seconds. Use '-session.timeout' to increase the session duration.
  [cloud-sql-proxy] 2024/03/11 03:35:04 Impersonating service account with Application Default Credentials
  [cloud-sql-proxy] 2024/03/11 03:35:05 [sams-dev-bfec:us-central1:postgresql-26ca] Listening on 127.0.0.1:5433
  [cloud-sql-proxy] 2024/03/11 03:35:05 The proxy has started successfully and is ready for new connections!
^C  [cloud-sql-proxy] 2024/03/11 03:35:06 SIGINT signal received. Shutting down...
```

---------

Co-authored-by: James Cotter <35706755+jac@users.noreply.github.com>
2024-03-11 15:06:32 +00:00
Robert Lin
46e107a3fb
msp: deployment rollout strategies (#59956)
Allows services to define a `rollout` spec that ensures new image releases go through a specified sequence and flow. We do this using Cloud Deploy and custom targets that update the Cloud Run service image and configuring Terraform to ignore image changes.

> [!NOTE]
> We use a custom target (as opposed to using the native Cloud Deploy + Cloud Run integration, which wants the entire spec in YAML for releases - see https://github.com/sourcegraph/managed-services/issues/186#issuecomment-1915196511) because everything else we have is generated in Terraform, and the core Cloud Run configuration extensively references Terraform values. It would be an extensive undertaking to change how this works. For the most part, this is to deploy a new version of the service code, and it can be beneficial to tie that to the service repository's CI to make it clear where a piece of code goes - building the custom target to _only_ roll out images allows us to do that.

Custom targets are not yet supported by the GCP Terraform provider, which is unfortunate - instead we have to render some YAML that can be applied with a `gcloud` command. For the most part, this should be a one-time operation. There is generated guidance on what to do with the generated output, and also how to create releases.

Closes https://github.com/sourcegraph/managed-services/issues/186

Kinda rambly, high-level Loom overview: https://www.loom.com/share/55bfa34d173c40a9b78708de2029f34f?sid=6f1b062d-ba02-4bb9-8abe-c9f8f8f9a8fe

### Configuring rollouts

In the top-level service spec:

```yaml
rollout:
  stages:
    - environment: test
    - environment: robert
```

And in each relevant environment:

```yaml
- id: robert
  projectID: msp-testbed-robert-7be9
  category: test
  deploy:
    type: rollout
```

`sg msp generate` will render resources for the "last" stage to house Cloud Deploy infrastructure.

### Creating releases

Creating a release triggers a rollout, which progresses through the specified stages, like so:

<img width="1347" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/9df0e510-08eb-4fd4-bbd4-1d58c6817bba">

Creating releases is intended to be run using `gcloud` commands for now - we could introduce a `sg msp` command for this later. The command creates a release targeting the Cloud Deploy pipeline that exists in the final-stage project. Example command (one is also generated in the pipeline YAML file docstrings):

```sh
gcloud deploy releases create manual-test-04-2024-01-31 \
    --project=msp-testbed-robert-7be9 \
    --region=us-central1 \
    --delivery-pipeline=msp-testbed-us-central1-rollout \
    --source='gs://msp-testbed-robert-7be9-cloudrun-skaffold/source.tar.gz' \
    --labels="commit=abc123,author=foo" \
    --deploy-parameters="customTarget/tag=dd34d1be076e_2024-01-31"
```

Promotions can happen at any time - not every release needs to be promoted to the subsequent stage - and currently must happen manually for each stage except the first.

A secret/output is provisioned with a "release creator" SA that can be used to create a [workload identity pool](https://sourcegraph.sourcegraph.com/github.com/sourcegraph/infrastructure/-/blob/managed-services/continuous-deployment-pipeline/main.tf?L5-20) that can be used to run the `gcloud deploy releases create` command in CI.

After the first apply, which now assumes an `insiders` tag, Terraform no longer touches the image via a [lifecycle ignore](https://developer.hashicorp.com/terraform/language/meta-arguments/lifecycle)

### Rollout execution

Rollouts happen via a new `clouddeploy-executor` SA in the last stage, which is granted sufficient IAM roles to deploy Cloud Run revisions.

The "render" step in Skaffold prepares a release - in our case, generating a `deploy.sh` with the prerequisite arguments. A record is available in the the relevant "rollout" page:

<img width="1694" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/53b70923-c0ce-4661-8e6f-d3444cf256e1">

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/1a82b2ce-50cc-4411-92a3-02cf9779465e)

The "deploy" step just downloads the artifact and executes it.

### Tracing a release

You can include arbitrary labels on releases - this shows up in the release entity in GCP console, but we don't yet propagate anything very well down to the Cloud Run revision. In particular, it seems like we don't get the tag information in the revision UI, but if you click "edit" you see the correct tag populated:

<img width="500" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/856ccd92-9e0d-41d9-b84c-1846b30a3f79">  <img width="500" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/f917be4b-f714-4fef-8871-2006fbf83901">


See `skaffold.yaml` - we're mostly just executing commands with `gcloud`, and reporting expected outputs. We can extend this with more detailed outputs and additional tagging or scripting if we want - examples I've seen often build a custom binary/image to execute more advanced use cases. Also see https://cloud.google.com/deploy/docs/custom-targets

### Rollbacks

[Cloud Deploy has a concept of rollbacks](https://cloud.google.com/deploy/docs/roll-back), which you can apply via UI - it seems this just runs the previous configuration:

<img width="868" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/6bdc8459-61b7-4ce6-9397-c2f9b3a29e8b">
<img width="1426" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/778241a7-3a97-45f9-b4a6-31bf81f5a8d5">

## Test plan

See https://github.com/sourcegraph/managed-services/pull/454 and https://console.cloud.google.com/deploy/delivery-pipelines/us-central1/msp-testbed-us-central1-rollout?project=msp-testbed-robert-7be9 . I also specifically tested that deploying a particular image, and then deploying a change in Terraform, does not overwrite the image, and we do not have infinite drift on the Terraform when releases deploy images.

Also https://github.com/sourcegraph/managed-services/actions/runs/7744296405
2024-02-14 11:12:11 +04:00
Robert Lin
a2af8d47b1
sg msp: add 'fleet' command for summaries (#59878)
Saves some time for #progress posts so that I'm not constantly counting these by hand
2024-01-25 15:53:49 -08:00
Robert Lin
7fe86c6137
msp: provision Opsgenie service for sync to Incident.io catalog, add required descriptions (#59569) 2024-01-17 15:30:30 -08:00
Robert Lin
c0f8d5b7dd
sg msp: use stable generation by default (#59612)
There are a few issues with unstable output by default:

1. Rolling out TF changes can inadvertently deploy new revisions
2. Some services have private images that the operator might not have access to - we don't want image update to block `sg msp generate` by default
3. Updating images via subscription should primarily be done via automation, or manually and explicitly

This change toggles `-stable=true` by default. I will update our image update workflow to use `-stable=false` explicitly: https://github.com/sourcegraph/managed-services/pull/392
2024-01-16 10:46:54 -08:00
Robert Lin
0060df720e
msp/monitoring: add external uptime check and alert, rework health probes configuration (#59461)
The new configuration is mostly based on Cody Gateway - if a service has an external domain, we create an uptime check and alert on failures.

The uptime check uses MSP standards, which depends on whether or not service health probes are configured. Since we use this in several places now, I've also reworked the health probes configuration to make it easier to reason with:

1. `healthProbes` now configures all healthchecks. `startupProbe` and `livenessProbe` has been removed
2. `disabled` is now `healthzProbes` - this configures if MSP healthchecks should be used, instead of default `/` ones.
3. By default, if no config is provided, MSP healthchecks are not used
4. If config is provided, MSP healthchecks must be explicitly disabled

Closes https://github.com/sourcegraph/managed-services/issues/350

This is required for our upcoming vendor evaluations as well.

This PR also includes a variety of internal improvements to alert policies.
2024-01-15 16:50:41 -08:00
Robert Lin
d23abb55d1
sg msp: improve error messaging (#59552)
Addresses some feedback from [this thread](https://sourcegraph.slack.com/archives/C05GJPTSZCZ/p1705074439174099):

1. `init-env` might be easily confused for `init`, this change adds an up-front check that all arguments are present and returns an error message suggesting `init` just in case you haven't created a service yet
2. If a service spec can't be opened, we now return an error message `service does not exist`
3. All callsites of `spec.Open` now wrap the error with the ID of the service they are expecting to open

## Test plan

```
$ sg msp init-env dev               
 exactly 2 arguments required, '<service ID>' and '<env ID>' -  this command is for adding an environment to an existing service, did you mean to use 'sg msp init' instead?
$ sg msp init-env asdfasdf dev
 load service "asdfasdf": service does not exist: open services/asdfasdf/service.yaml: no such file or directory
```
2024-01-12 18:21:27 +00:00
Robert Lin
7823536328
sg msp: add 'sg msp operations generate-handbook-pages' (#59496)
Expand the new `sg msp ops` command to create an entire directory tree in https://github.com/sourcegraph/handbook. For now, we assume someone will update this by hand from time to time - environments should generally be fairly static.

## Test plan

```
sg msp operations generate-handbook-pages
```

https://github.com/sourcegraph/handbook/pull/8429

---------

Co-authored-by: James Cotter <35706755+jac@users.noreply.github.com>
2024-01-11 19:37:29 +00:00
Robert Lin
74c341af3e
sg msp: use static list of stack names, remove --tfc=false (#59483)
This change adds a static list of all workspaces we have. This is unlikely to change much more in the future. This static list can be used for:

1. Syncing Terraform Cloud workspaces (we no longer need to render the stack to do so)
2. CLI completions where appropriate

To make sure the static list holds, I've also removed the option to not use TFC as the backend.
2024-01-11 10:11:30 -08:00
Robert Lin
1ee4b0393b
sg msp: expand destroy protection to more features (#59462) 2024-01-10 09:32:13 -08:00
Robert Lin
26577b3386
sg/msp: add 'sg msp tfc graph' for resources (#59401)
We likely need architecture diagrams eventually, especially for SOC2, and I thought it might be good to explore what we can generate, since MSP infra architecture is a bit conditional on service specification, documenting by hand can prove rather difficult, outside of `sg msp operations`.

I tried walking the `cdktf.TerraformStack` graph, but couldn't figure out how to get dependencies correctly. In the end @michaellzc pointed me to `terraform graph`, which uses TF plans to prepare a graph. It includes stuff that doesn't feel very important, so I added a bunch of crude filtering to make the graph a bit more usable.

The layout is not _great_ - I tried the various [dot layout engines](https://graphviz.org/docs/layouts/) and `unflatten` but none of them worked very well for our use case - but the information is actually kind of useful, and does illustrate a realistic graph of the various pieces involved.

A default rendering of the graph is available with `sg msp tfc graph`, and you can get the dot-format configuration with `sg msp tfc graph -dot`. I think the grouping under the `tfc` commands makes sense because you do need TFC to generate this.

Part of https://github.com/sourcegraph/managed-services/issues/361 and https://github.com/sourcegraph/managed-services/issues/328

## Test plan

```
sg msp tfc graph sams dev cloudrun
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/9c4e4441-9405-4610-b773-5b960fbb0c23)
2024-01-08 16:58:55 -08:00
Robert Lin
2345ed8fc3
sg/msp: add prototype 'sg msp docs' (#59348)
This is a start to https://github.com/sourcegraph/managed-services/issues/328 and not meant to be complete - it creates some Markdown that can be read in terminal for now. Eventually we may want to persist this somewhere.
2024-01-05 05:07:43 +00:00
Robert Lin
676defe4b3
sg msp: use env GSM only, fix 'sg msp logs' for jobs (#59346)
Previously, commands like `sg msp pg connect` required TFC access to run. Now, we just use the outputs exported to GSM directly (#59341), so that these commands can run if you have access to the GCP project only.
2024-01-04 18:07:42 -08:00
Robert Lin
8bdd7e404f
msp: restructure to export stacks implementations (#59345)
This lifts stacks from `dev/managedservicesplatform/internal/stack` to `dev/dev/managedservicesplatform/stacks`, making it easier to share consts (namely outputs) directly with MSP tooling.
2024-01-05 00:56:12 +00:00
Robert Lin
4ca528adc0
sg/msp: improve test coverage on examples, fix project ID generation (#59343) 2024-01-04 16:13:11 -08:00
Robert Lin
8b32ce6b87
msp: emit StackLocals as GSM secrets (#59341)
Right now, commands like `sg msp db connect` need to access TFC workspace outputs. This is clunky because it requires another Entitle roundtrip to get credentials and access to TFC.

Now that we configure project ID up-front, `sg msp` can just reach out to the service environment's project ID for secrets - by adding all local variables/TF outputs to GSM as well, we can now get access to everything with just one Entitle request on the environment project or folder.

This change only emits StackLocals as GSM secrets - I'll make the actual tooling changes in a follow-up.
2024-01-04 15:30:19 -08:00
Robert Lin
ffe3e450c0
sg msp: 'init-env' fixes (#59254)
Fix path generation and arguments from https://github.com/sourcegraph/sourcegraph/pull/59220 for the `sg msp init-env` command.

Also updates the initial example service spec to match the one generated from writing the YAML back to disk - there's not too much control over the indentation offered by the library.
2023-12-29 16:34:40 -08:00
Robert Lin
4a640b5e96
msp: generate projectID up-front and persist in spec (#59220)
This is a big diff, but they all tie together, so hear me out:

The only way to get the project ID right now is to query the appropriate Terraform Cloud workspace outputs. However, to do that, you need access to `sourcegraph-secrets`, to get the appropriate TFC access token.

This is awkward because as an operator, you would follow the instructions to request `mspServiceEditor` on your desired project - but now, to use various MSP tooling like `sg msp pg connect`, you must _also_ request access to `sourcegraph-secrets`, so that we can get a TFC token to find the project ID and other stuff. Because we might have a large number of services it's not feasible to manually set up Entitle bundles (they cannot be programmatically created).

The approach I want to take is to copy the MSP team TFC token from `sourcegraph-secrets` into each individual MSP environment project. Then, we can get the MSP team TFC token from the _environment_ project instead, access for which will be granted by the `mspServiceEditor` role. To do this however, we must know the project ID up front. So this PR makes the following changes:

1. Makes it so that the randomized project ID isn't managed by Terraform, but generated statically, and configured in `environments[].projectID`.
2. This requires changes to `sg msp init` to create a project ID the same way we create it in-Terraform today, but in addition to service initialization, we must now also have tooling to start configuration a new environment as well, so that we can generate a project ID for the operator. This is done via a new command, `sg msp init-env`, which inserts a new environment into a service spec.
   - MSP service specs are intended to be operator-written and hopefully include lots of docstrings on configuration, so we take special care to preserve formatting and comments by manipulating `yaml.Node` directly.
4. In order to use `yaml.Node`, however, we must switch over to `gopkg.in/yaml.v3` - previously, we used the K8S YAML library, mostly as a carry-over from what is used in Cloud. In order to use `gopkg.in/yaml.v3`, we need to:
   - Replace all `json` struct tags with `yaml`, as the YAML library does not support JSON tags
   - Upgrade `github.com/invopop/jsonschema` so that we can point the JSON schema generator to use the `yaml` tags as well
5. Now that we have `projectID` statically available, we can remove code that queries TFC workspaces for the project ID and replace them with references to the spec instead.

## Test plan

1. Unit tests on `sg msp init`'s generated output
2. Unit tests on inserting environment
3. Unit tests on project ID generator
4. https://github.com/sourcegraph/managed-services/pull/295
2023-12-22 17:25:40 -08:00
Robert Lin
114a883473
msp: add 'sg msp logs' (#59190)
Jump to a tidy view of service logs (assuming `sourcegraph/log`) in browser quickly.
2023-12-22 18:04:10 +00:00
Robert Lin
bb12ca2eb3
msp: fix tfc workspace sync on services with private images (#59196)
We generate all stacks to get a list of stacks for which we need TFC workspaces - we need to do this in "stable mode" to avoid image access on subscription types (same thing we do for CI)
2023-12-22 09:50:20 -08:00
Noah S-C
0358a79f24
bazel: add no-localhost-guard lint to nogo (#59144)
Also fixes all the violations that snuck past due to https://github.com/sourcegraph/sourcegraph/pull/46202 confusingly breaking the `git grep` call 🤔 

## Test plan

Tested with various combinations of [the following line](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@main/-/blob/internal/conf/computed.go?L300)
2023-12-21 16:30:33 +00:00
Robert Lin
6c7695b51b
sg msp: add 'sg msp tfc view' (#59142)
Convenience helper for opening workspaces in browser. I'm hoping to expand sg msp with more commands to view resources in GCP console and stuff.
2023-12-20 16:20:53 -08:00