Commit Graph

183 Commits

Author SHA1 Message Date
James Cotter
4c040347ec
sg/msp: enable alerting by default for production projects (#63912)
<!-- PR description tips:
https://www.notion.so/sourcegraph/Write-a-good-pull-request-description-610a7fd3e613496eb76f450db5a49b6e
-->
Opsgenie alert notifications for critical alerts should be enabled by
default for production projects or where `env.alerting.opsgenie` is set
to true.

Closes CORE-223
## Test plan
Tested locally by running `sg msp gen` for a `prod` env which doesn't
have an alerting config and verifying that notification suppression was
disabled

Set `env.alerting.opsgenie` to false which enabled suppression again.

No changes to `test` environments unless `env.alerting.opsgenie` is set
to true.
2024-07-18 20:57:38 +01:00
Robert Lin
28348e7c80
feat/msp: allow enablement of logical replication features for Datastream (#63092)
Adds a new `postgreSQL.logicalReplication` configuration to allow MSP to
generate prerequisite setup for integration with Datastream:
https://cloud.google.com/datastream/docs/sources-postgresql. Integration
with Datastream allows the Data Analytics team to self-serve data
enrichment needs for the Telemetry V2 pipeline.

Enabling this feature entails downtime (Cloud SQL instance restart), so
enabling the logical replication feature at the Cloud SQL level
(`cloudsql.logical_decoding`) is gated behind
`postgreSQL.logicalReplication: {}`.

Setting up the required stuff in Postgres is a bit complicated,
requiring 3 Postgres provider instances:

1. The default admin one, authenticated with our admin user
2. New: a workload identity provider, using
https://github.com/cyrilgdn/terraform-provider-postgresql/pull/448 /
https://github.com/sourcegraph/managed-services-platform-cdktf/pull/11.
This is required for creating a publication on selected tables, which
requires being owner of said table. Because tables are created by
application using e.g. auto-migrate, the workload identity is always the
table owner, so we need to impersonate the IAM user
3. New: a "replication user" which is created with the replication
permission. Replication seems to not be a propagated permission so we
need a role/user that has replication enabled.

A bit more context scattered here and there in the docstrings.

Beyond the Postgres configuration we also introduce some additional
resources to enable easy Datastream configuration:

1. Datastream Private Connection, which peers to the service private
network
2. Cloud SQL Proxy VM, which only allows connections to `:5432` from the
range specified in 1, allowing a connection to the Cloud SQL instance
2. Datastream Connection Profile attached to 1

From there, data team can click-ops or manage the Datastream Stream and
BigQuery destination on their own.

Closes CORE-165
Closes CORE-212

Sample config:

```yaml
  resources:
    postgreSQL:
      databases:
        - "primary"
      logicalReplication:
        publications:
          - name: testing
            database: primary
            tables:
              - users
```

## Test plan

https://github.com/sourcegraph/managed-services/pull/1569

## Changelog

- MSP services can now configure `postgreSQL.logicalReplication` to
enable Data Analytics team to replicate selected database tables into
BigQuery.
2024-07-05 18:24:44 +00:00
Robert Lin
2958abc326
fix/msp/postgresqlroles: wait for databases to be provisioned (#63362)
Wait for databases to be provisioned before granting database-specific
roles to the operator access user.

## Test plan

Re-apply fixed
https://sourcegraph.slack.com/archives/C05E2LHPQLX/p1718850688397579,
indicating a race condition on database creation. Diff looks good:

```diff
@@ -1447,10 +1472,15 @@
             "path": "cloudrun/cloudrun-postgresqlroles-msp_iam-operator_access_service_account_table_grant",
             "uniqueId": "cloudrun-postgresqlroles-msp_iam-operator_access_service_account_table_grant"
           }
         },
         "database": "msp_iam",
+        "depends_on": [
+          "google_sql_database.postgresql-database-enterprise-portal",
+          "google_sql_database.postgresql-database-enterprise_portal",
+          "google_sql_database.postgresql-database-msp_iam"
+        ],
         "object_type": "table",
         "objects": [
         ],
         "privileges": [
           "SELECT"
```

## Changelog

- MSP Cloud SQL: Fix race condition between database creation and role
grants for the read-only operator access user
2024-06-20 07:43:14 -07:00
Robert Lin
1aeb9c93f1
chore/msp: document gRPC notes in spec docstrings (#63140)
Lessons learned from
https://sourcegraph.slack.com/archives/C05E2LHPQLX/p1717703306405529

## Test plan

n/a
2024-06-06 14:20:50 -07:00
Robert Lin
27211dea73
feat/msp: update handbook link in alerts dashboard, sort custom alerts first (#63089)
1. The dashboard link still points to the old `go/msp-ops/...` which no
longer work (CORE-105)
2. Alerts defined on top of the MSP defaults are probably of more
interest, so let's sort these in front of the others

## Test plan

Unit/golden tests
2024-06-05 09:09:22 -07:00
Robert Lin
a3fe573b59
fix/msp: flatten custom alert promQL query for GCP (#63084)
The GCP monitoring alert configuration expects, for some reason, a
single-line PromQL query only, otherwise the threshold doesn't work. In
configuration, however, we may want to write a multi-line query, for
ease of readability. This change automatically flattens the PromQL query
into a single line and strips extra spaces.

Part of CORE-161

## Test plan

Unit tests
2024-06-04 14:37:51 -07:00
Robert Lin
908d7119ea
chore/msp: blindly retry Notion page deletion (#63052)
Deleting Notion pages takes a very long time, and is prone to breaking in the page deletion step, where we must delete blocks one at a time because Notion does not allow for bulk block deletions. The errors seem to generally just be random Notion internal errors. This is very bad because it leaves go/msp-ops pages in an unusable state.

To try and mitigate, we add several places to blindly retry:

1. At the Notion SDK level, where a config option is available for retrying 429 errors
2. At the "reset page" helper level, where a failure to reset a page will prompt a retry of the whole helper
3. At the "delete blocks" helper level, where individual block deletion failures will be retried

Attempt to mitigate https://linear.app/sourcegraph/issue/CORE-119

While here, I also made some other QOL tweaks:

- Fix timing of sub-tasks in CLI output
- Bump default concurrency to 5 (our retries will handle if this is too aggressive, hopefully)
- Fix a missing space in generated docs

## Test plan

```
sg msp ops generate-handbook-pages   
```
2024-06-03 22:32:06 +00:00
Robert Lin
617d2f766c
chore/msp/spec: tidy up custom alerts spec (#63050)
Follow-ups for #62885:

- Better docstrings for `mql`, `promql`
- `duration` -> `durationMinutes` to align with other config
- `alertpolicy.ResponseCodeMetric` -> `spec.CustomAlertCondition`: they're effectively the same type

Test plan: CI
2024-06-03 13:53:01 -07:00
Robert Lin
012db75133
fix/msp: make deadlineSeconds job-level configuration, apply in timeout (#63017)
In a rushed POC of MSP jobs, I did some pretty bad copy-pasting (evidenced by all the service-specific docstrings I have removed in this PR) and made a bad configuration decision here, resulting in a few issues:

1. `schedule.deadline` is not actually applied to Cloud Run jobs, causing jobs to time out earlier than desired
2. `schedule.deadline` is not the right place to configure a deadline, because _all_ jobs need a configurable deadline, not just those with schedules. This change moves `schedule.deadline` to `deadlineSeconds`.

Closes CORE-145

## Test plan

```
$ sg msp generate gatekeeper prod
$ git diff
```

```diff                    
diff --git a/services/gatekeeper/service.yaml b/services/gatekeeper/service.yaml
index fd6a3812..ce4b02e3 100644
--- a/services/gatekeeper/service.yaml
+++ b/services/gatekeeper/service.yaml
@@ -48,4 +48,4 @@ environments:
           - "primary"
     schedule:
       cron: 0 * * * *
-      deadline: 1800 # 30 minutes
+    deadlineSeconds: 1800 # 30 minutes
diff --git a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
index 3c2c295e..f83b32b9 100644
--- a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
+++ b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
@@ -281,7 +281,7 @@
                   },
                   {
                     "name": "JOB_EXECUTION_DEADLINE",
-                    "value": "600s"
+                    "value": "1800s"
                   }
                 ],
                 "image": "us.gcr.io/sourcegraph-dev/abuse-ban-bot:${var.resolved_image_tag}",
@@ -302,7 +302,7 @@
               }
             ],
             "service_account": "${data.terraform_remote_state.cross-stack-reference-input-iam.outputs.cross-stack-output-google_service_accountiam-workload-accountemail}",
-            "timeout": "300s",
+            "timeout": "1800s",
             "volumes": [
             ],
             "vpc_access": {
@@ -341,7 +341,7 @@
             "uniqueId": "job_scheduler"
           }
         },
-        "attempt_deadline": "600s",
+        "attempt_deadline": "1800s",
         "depends_on": [
           "google_cloud_run_v2_job_iam_member.cloudrun_scheduler_job_invoker"
         ],
```

## Changelog

- MSP jobs: `schedule.deadline` is deprecated, use the top-level `deadlineSeconds` instead. Configured deadlines are now correctly applied as the Cloud Run job execution timeout as well.
2024-05-31 21:15:31 +00:00
Robert Lin
7170d4bd2b
feat/msp: add link to ops page in Slack channel description (#63011)
Minor QOL improvement, when you're in the Slack channel the chances are good that you might want the ops docs at some point.

## Test plan

n/a

## Changelog

- MSP-provisioned alerts Slack channels now include a link to the service's generated operational docs for a service (go/msp-ops) in the channel description.
2024-05-31 17:59:22 +00:00
James Cotter
2f4e3b9272
sg/msp: fix nil domain and EnvironmentDomainTypeNone in diagram gen (#62982) 2024-05-30 17:58:00 +01:00
Robert Lin
9e4a8e8033
feat/sg/msp: add 'sg msp validate' for validating service specifications (#62973) 2024-05-30 09:11:36 -07:00
Robert Lin
27f0d725ac
feat/msp/spec: require notionPageID if a production env is provisioned (#62972) 2024-05-30 09:01:42 -07:00
Robert Lin
cb62afa2c2
fix/msp: test for cron interval changes based on time, add more restrictions (#62969)
Addresses problem noticed in https://github.com/sourcegraph/managed-services/pull/1486#issuecomment-2137887423

## Test plan

Unit tests

## Changelog

- Fixed an issue with output of `sg msp generate` for MSP jobs with particular schedules changing throughout the week
- MSP jobs schedules now must be between 15 minutes at the most frequent, and every week at the least frequent
2024-05-29 18:24:39 -07:00
James Cotter
cb71a2d529
sg/msp: support for super-simple alerts on custom metrics (#62885)
---------

Co-authored-by: Joe Chen <joe@sourcegraph.com>
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
2024-05-24 20:47:19 +01:00
Robert Lin
e0a7c0d3a6
fix/msp/spec: validate against LivenessInterval that is too high (#62872)
Guards against another one of those "only fails at apply time" things (https://github.com/sourcegraph/managed-services/pull/1459)
2024-05-22 22:09:37 +00:00
Robert Lin
6c59b02534
feat/msp: do not use tfvars file outside of deploy-type 'subscription' (#62704)
Closes CORE-121

The dependency on the generated `tfvars` file is frustrating for first-time MSP setup because it currently requires `-stable=false` to update, and doesn't actually serve any purpose for deploy types other than `subscription` (which uses it to isolate image changes that happen on via GitHub actions). This makes it so that we don't generate, or depend on, the dynamic `tfvars` file unless you are using `subscription`.

I've also added a rollout spec configuration, `initialImageTag`, to make the initial tag we provision environments with configurable (as some services might not publish `insiders` images) - see the docstring.

## Test plan

Inspect output of `sg msp generate -all`
2024-05-16 09:43:47 -07:00
Noah S-C
9b6ba7741e
bazel: transcribe test ownership to bazel tags (#62664) 2024-05-16 15:51:16 +01:00
James Cotter
d1404951eb
sg/msp: fix CustomTargetType reference in Target definition (#62727) 2024-05-16 13:32:37 +01:00
James Cotter
75356f8606
sg/msp: clarify repository annotation meaning in delivery pipeline (#62703)
PR feedback from: https://github.com/sourcegraph/sourcegraph/pull/62702
2024-05-15 21:46:39 +01:00
James Cotter
3b394e7954
sg/msp: add repo annotation to delivery pipeline (#62702) 2024-05-15 12:35:00 -07:00
Robert Lin
cb15cea2b0
msp/cloudrun: use GA launch stage (#62685)
VPC direct egress is now GA: see example in https://registry.terraform.io/providers/hashicorp/google/5.29.0/docs/resources/cloud_run_v2_service#example-usage---cloudrunv2-service-directvpc and https://cloud.google.com/run/docs/configuring/vpc-direct-vpc

This also fixes the infinite `GA` -> `BETA` drift we have in TFC
2024-05-15 17:32:54 +01:00
Robert Lin
cc6cfd8499
msp/rollouts: remove Cloud Deploy target import (#62687)
Now that #62644 (CORE-23) is rolled out, this import block is no longer needed (and may even be disruptive when provisioning new rollout pipelines). The change was rolled out in:

- https://github.com/sourcegraph/managed-services/pull/1416
- https://github.com/sourcegraph/managed-services/pull/1417
- https://github.com/sourcegraph/managed-services/pull/1403

## Test plan

n/a
2024-05-15 17:32:54 +01:00
Robert Lin
456315b54d
msp/rollouts: use new in-terraform custom target provisioning (#62644)
Closes CORE-23 - this change removes the manual `gcloud deploy apply` step previously required to enable MSP rollouts, thanks to a recent release of the Google Terraform provider.

## Test plan

https://github.com/sourcegraph/managed-services/pull/1403
2024-05-14 18:51:33 -07:00
Robert Lin
7308d16db9
msp/terraform: upgrade to 1.7.5 (#62650)
According to https://developer.hashicorp.com/terraform/language/v1.7.x/upgrade-guides this should be compatible with our current version, 1.3.10

We need to upgrade to use `import` blocks (TF 1.5), which will make https://github.com/sourcegraph/sourcegraph/pull/62644 and CORE-23 capable of a smooth rollout (otherwise we encounter conflict with the previously hand-deployed resources).

This also requires our CDKTF modules to be regenerated with the new Terraform version: https://github.com/sourcegraph/managed-services-platform-cdktf/pull/10

## Test plan

n/a - will do a staged rollout per https://www.notion.so/sourcegraph/MSP-infrastructure-upgrades-1808e7e45bd54f419dd93af542d99238?pvs=4
2024-05-14 12:33:06 -07:00
James Cotter
1d2076fc87
sg/msp: fix typo in exernal_health_check description (#62659) 2024-05-14 12:31:02 -07:00
Robert Lin
71555cc0b1
msp/operationdocs: fix bad formatting (#62641)
Noticed some leftover awkward formatting from https://github.com/sourcegraph/sourcegraph/pull/62607

## Test plan

Golden tests
2024-05-13 14:31:01 -07:00
James Cotter
cf9bcb3d80
sg/msp: upgrade sentry (#62636) 2024-05-13 10:38:18 -07:00
Robert Lin
fdf0bf9a02
msp/operationdocs: add incident response starter guide, Notion-specific formatting (#62607)
Closes CORE-20: adds a small per-service "incident response" section near the alerts reference section of each service, providing some simple starter context and linking to other relevant guidance.

This change also makes some Notion-oriented formatting tweaks: putting all paragraphs on a single line (because of https://github.com/sourcegraph/notionreposync/issues/9) and also rendering callouts with appropriate background colors (https://github.com/sourcegraph/notionreposync/pull/11).

## Test plan

Golden tests, roll out to Notion:

```sh
GITHUB_ACTIONS=true sg msp ops generate-handbook-pages
```

Incident response:

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/d07e0071-870f-4acb-b4a4-2246b40850a3)

Callouts:

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/6ec7dbea-cafd-40e0-b50c-780c4e9cbd22)
2024-05-10 23:56:41 +00:00
Robert Lin
7b6dd9080e
msp: centralize and expose locations configuration (#62604)
This change adds a `locations: { gcpRegion: "...", gcpLocation: "..." }` configuration to centralize all location-related options. `gcpRegion` specifies regional preferences, while `gcpLocation` specifies multi-regional preferences (for resources that support it - only BigQuery in most cases).

Closes CORE-24 - see issue for some context.

## Test plan

```
sg msp generate -all # no diff
```

```
sg msp schema -output='../managed-services/schema/service.schema.json'
```
2024-05-10 15:50:07 -07:00
James Cotter
2d5ed2e735
sg/msp: add cloud deploy pubsub notifications (#62596)
---------

Co-authored-by: Joe Chen <joe@sourcegraph.com>
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
2024-05-10 22:51:47 +01:00
Robert Lin
022b4ad95f
msp/terraformcloud: add option to respect existing run mode (#62580)
When using https://github.com/sourcegraph/sourcegraph/pull/62565, we override test environments that are in CLI mode, which can cause infra to be rolled out by surprise via VCS mode on switch - this change adds an option to respect the existing run mode configuration via `-workspace-run-mode=ignore`.

Thread: https://sourcegraph.slack.com/archives/C06JENN2QBF/p1715256898022469?thread_ts=1715251558.736709&cid=C06JENN2QBF

## Test plan

```
sg msp tfc sync -all
👉 Syncing all environments for all services, including setting ALL workspaces to use run mode "vcs" (use '-workspace-run-mode=ignore' to respect the existing run mode) - are you sure? (y/N)  N
 aborting
Projects/sourcegraph/managed-services 1 » sg msp tfc sync -all -workspace-run-mode=ignore
👉 Syncing all environments for all services - are you sure? (y/N)  y
// ...
```
2024-05-09 14:57:40 -07:00
Robert Lin
4d6455996c
msp: add infra and runtime support for job checkins (#62508)
Closes CORE-21 - allows jobs to register check-ins using Sentry when they are configured as cron jobs: https://docs.sentry.io/product/crons/, for a nice view of "is my job running or nah" without using GCP's less-than-beautiful console views

1. Adds the configured schedule and deadline as environment variables for MSP jobs
2. Adds a contract mechanism for checking in, for example:
```go
	func work(ctx context.Context) (err error) {
		done, err := c.Diagnostics.JobExecutionCheckIn(ctx)
		if err != nil { /* failed to register check-in */ }
		defer done(err)

		// ... do work
	}
```

## Test plan

```sh
TestJobExecutionCheckIn_SENTRY_DSN='...' go test -v ./runtime/contract
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/8998af89-e74a-44a5-939a-92c8b63ea262)

In Slack:

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/0677e2db-5a33-4751-ae86-d43e5b1e159f)

It appears the message is not currently customizable: https://develop.sentry.dev/sdk/check-ins/

---------

Co-authored-by: Joe Chen <joe@sourcegraph.com>
2024-05-09 10:11:48 -07:00
Robert Lin
1463f6724f
msp/terraformcloud: grant 'sso' team read access to MSP workspaces (#62559) 2024-05-09 09:53:20 -07:00
Robert Lin
e1fa3393b5
msp: update msp-ops links (#62435)
With https://github.com/sourcegraph/sourcegraph/pull/62325 landed, this updates a few references:

1. Service golinks don't work anymore: CORE-105
3. Alerts docs now point to the service Notion page
2024-05-06 10:36:15 -07:00
Robert Lin
f444774570
msp/operationdocs: write to Notion instead of Markdown (#62325)
Closes https://github.com/sourcegraph/managed-services/issues/1076 aka closes CORE-28 - https://github.com/sourcegraph/managed-services/pull/1332 also automated the updates.

Notes:

1. Notion anchors are ID-based, so we strip all anchor links because we cannot generate them in one pass. `notionreposync` may implement a mechanism for this in the future (filed https://github.com/sourcegraph/notionreposync/issues/8), but for now we don't have a great way around this, especially because of the next point
2. In-place updates are hard because of the block structure, so we destroy page contents and recreate the page every time. This causes a "flicker" as a viewer may see the page disappear slowly before their eyes (we can only delete things 1 block at a time). `notionreposync` may may implement an improved mechanism for this in the future (filed https://github.com/sourcegraph/notionreposync/issues/7)
3. There's something funky going wrong with line breaks, filed https://github.com/sourcegraph/notionreposync/issues/9 - it hurts readability but it's not unmanageable.
4. Sadly Notion does not allow API file uploads (😡 https://developers.notion.com/docs/working-with-files-and-media#uploading-files-and-media-via-the-notion-api), so we generate them into the managed-services repo (https://github.com/sourcegraph/managed-services/pull/1332) and then just link to those diagrams from the generated page. We use a Markdown file that renders the SVG because the native SVG viewer sucks.
5. Made misc changes to operationdocs output where Notion's version is noticeably worse, or difficult to support (tables, lists-in-admonitions, etc)

Depends on various improvements upstream in https://github.com/sourcegraph/notionreposync:

- https://github.com/sourcegraph/notionreposync/pull/4
- https://github.com/sourcegraph/notionreposync/pull/5
- https://github.com/sourcegraph/notionreposync/pull/6

Follow-up improvements:

- CORE-105
- CORE-106

## Test plan

```
sg msp ops generate-handbook-pages
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/9d314a97-5370-4123-9534-f9f897002110)

https://sourcegraph.notion.site/Build-Tracker-infrastructure-operations-bd66bf25d65d41b4875874a6f4d350cc

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/69e1eb48-2fa9-421b-b2fd-969e25f37fee)

https://github.com/sourcegraph/managed-services/pull/1332

---------

Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
2024-05-03 21:17:16 +00:00
James Cotter
6d7082d26e
sg/msp: architecture diagrams (#62213) 2024-05-01 13:57:34 +01:00
Robert Lin
fd2b746b02
msp/redis: disable HA by default (#62210)
https://github.com/sourcegraph/managed-services/pull/1307 "pinned" all services that should have HA Redis - this change flips the default to use the lower-cost alternative by default.

## Test plan

https://github.com/sourcegraph/managed-services/actions/runs/8884108494
2024-04-30 11:26:48 -07:00
James Cotter
4909533715
sg/msp: upgrade google/google_beta to v5.26.0 (#62251) 2024-04-29 19:34:28 +00:00
Robert Lin
54245a7a0d
msp/spec: exclude populated README from YAML (#62215) 2024-04-26 20:47:09 +00:00
Robert Lin
8986f9cd99
msp/redis: make tier changes graceful (#62211)
Makes the downgrade option introduced in https://github.com/sourcegraph/sourcegraph/pull/62137 usable, since a tier change in Redis is a force-recreate situation (unlike Cloud SQL)

## Test plan

- [x] Applied downgrade directly in msp-testbed-robert https://app.terraform.io/app/sourcegraph/workspaces/msp-msp-testbed-robert-cloudrun/runs
- [x] Switched robert back to VCS mode, monitored upgrade https://app.terraform.io/app/sourcegraph/workspaces/msp-msp-testbed-robert-cloudrun/runs/run-PFTNBXLBdZGWTm1Z

No alerts fired during either of the above
2024-04-26 12:22:50 -07:00
Robert Lin
8069fabdc3
managedservicesplatform: add generalized 'HA' toggle for Redis, Cloud SQL (#62137)
This change adds an explicit 'HA' toggle for Redis and Cloud SQL, as part of investigation into https://github.com/sourcegraph/managed-services/issues/311:

- `redis.highAvailability`: sets the standard HA mode. Right now, we do this by default to preserve existing behaviour - a follow-up to this PR would be to explicitly set the HA mode on production services, then make this default `false`.
- `postgreSQL.highAvailability`: enables regional mode, and also point-in-time-recovery as required. This could be quite expensive - we have ~$200/mo of Cloud SQL expenses, mostly in CPU before discounts, so this is projected to ~double that bill, but would be the simplest and most reliable way to maintain uptime in the event of a zonal failover.

The plan is not necessarily to immediately make use of `postgreSQL.highAvailability` but have the option open, as the configuration is trivial - we could still develop a manual failover process to adhere to requirements outlined in https://github.com/sourcegraph/managed-services/issues/311

Preparing a summary here: https://www.notion.so/sourcegraph/MSP-service-availability-655e89d164b24727803f5e5a603226d8?pvs=4

## Test plan

`sg msp generate -all` has no diff
2024-04-25 20:48:23 +00:00
Robert Lin
3623ecb2cf
sg/msp: filter generated environments by category (#62131)
Part of https://github.com/sourcegraph/managed-services/issues/599

See https://github.com/sourcegraph/managed-services/pull/1288 for how this mechanism will be used.

## Test plan

```sh
sg msp generate -all -category=test
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/d27c5df5-6d13-43fa-96c2-5523da24693a)
2024-04-24 09:44:16 -07:00
James Cotter
cd44cf24db
sg/msp: remove flaky projectid test (#62155) 2024-04-24 15:04:51 +00:00
James Cotter
738a37c7a9
sg/msp: add alert policy documentation to generated ops pages (#61939) 2024-04-18 19:46:19 +00:00
Robert Lin
2189b2991f
msp/cloudrun: use VPC direct egress (#60466)
Adopts [Cloud Run VPC direct egress](https://cloud.google.com/run/docs/configuring/vpc-direct-vpc) for private networks. Private networks are used by MSP services that connect to Cloud SQL, Memorystore (Redis), or other MSP services via VPC-SC perimeters. On paper, this should give us:

- Likely smaller bill, as we no longer pay for serverless VPC connector VMs
- Reduced latency on traffic through private network
- Reduced latency on traffic spikes as serverless VPC connector no longer needs to scale out

There are some caveats we are discussing in Slack: https://sourcegraph.slack.com/archives/C05E2LHPQLX/p1713324250520539

Closes https://github.com/sourcegraph/managed-services/issues/317

## Test plan

Rolled this out without downtime to `msp-testbed: robert`. The VPC private networking test used in https://github.com/sourcegraph/managed-services/pull/1024 still works.
2024-04-17 17:13:50 +00:00
James Cotter
c9a53faea1
msp: GCP Monitoring Dashboard (#61761) 2024-04-17 16:51:36 +01:00
Robert Lin
c02266057c
msp/project: enable accesscontextmanager APIs (#61679)
Prep work for https://github.com/sourcegraph/managed-services/issues/660 - required to manage some VPC SC perimeter APIs

## Test plan

n/a
2024-04-08 18:12:41 -04:00
James Cotter
d26be1349b
msp: update ops with deployment info (#61610)
* msp: update ops with deployment info

* Update dev/managedservicesplatform/operationdocs/operationdocs.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* add rollout section

* sentence case

* Update dev/managedservicesplatform/operationdocs/operationdocs.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* update test files

---------

Co-authored-by: Robert Lin <robert@bobheadxi.dev>
2024-04-08 13:18:58 +01:00
Robert Lin
3dcfd4c53d
msp/privatenetwork: allow PrivateIpGoogleAccess from subnet (#61648)
Minor change to allow option 2 as outlined in https://github.com/sourcegraph/managed-services/issues/660#issuecomment-2027394872, but also could be beneficial in the future to investigate using private google access for e.g. Telemetry Gateway publishing to Cloud Run, which is pretty high-volume, and even for simple things like traces/metric export (tracked in https://github.com/sourcegraph/managed-services/issues/1093)

## Test plan

Deployed to msp-testbed-robert
2024-04-06 00:57:31 +09:00