1. The dashboard link still points to the old `go/msp-ops/...` which no
longer work (CORE-105)
2. Alerts defined on top of the MSP defaults are probably of more
interest, so let's sort these in front of the others
## Test plan
Unit/golden tests
The GCP monitoring alert configuration expects, for some reason, a
single-line PromQL query only, otherwise the threshold doesn't work. In
configuration, however, we may want to write a multi-line query, for
ease of readability. This change automatically flattens the PromQL query
into a single line and strips extra spaces.
Part of CORE-161
## Test plan
Unit tests
Deleting Notion pages takes a very long time, and is prone to breaking in the page deletion step, where we must delete blocks one at a time because Notion does not allow for bulk block deletions. The errors seem to generally just be random Notion internal errors. This is very bad because it leaves go/msp-ops pages in an unusable state.
To try and mitigate, we add several places to blindly retry:
1. At the Notion SDK level, where a config option is available for retrying 429 errors
2. At the "reset page" helper level, where a failure to reset a page will prompt a retry of the whole helper
3. At the "delete blocks" helper level, where individual block deletion failures will be retried
Attempt to mitigate https://linear.app/sourcegraph/issue/CORE-119
While here, I also made some other QOL tweaks:
- Fix timing of sub-tasks in CLI output
- Bump default concurrency to 5 (our retries will handle if this is too aggressive, hopefully)
- Fix a missing space in generated docs
## Test plan
```
sg msp ops generate-handbook-pages
```
Follow-ups for #62885:
- Better docstrings for `mql`, `promql`
- `duration` -> `durationMinutes` to align with other config
- `alertpolicy.ResponseCodeMetric` -> `spec.CustomAlertCondition`: they're effectively the same type
Test plan: CI
In a rushed POC of MSP jobs, I did some pretty bad copy-pasting (evidenced by all the service-specific docstrings I have removed in this PR) and made a bad configuration decision here, resulting in a few issues:
1. `schedule.deadline` is not actually applied to Cloud Run jobs, causing jobs to time out earlier than desired
2. `schedule.deadline` is not the right place to configure a deadline, because _all_ jobs need a configurable deadline, not just those with schedules. This change moves `schedule.deadline` to `deadlineSeconds`.
Closes CORE-145
## Test plan
```
$ sg msp generate gatekeeper prod
$ git diff
```
```diff
diff --git a/services/gatekeeper/service.yaml b/services/gatekeeper/service.yaml
index fd6a3812..ce4b02e3 100644
--- a/services/gatekeeper/service.yaml
+++ b/services/gatekeeper/service.yaml
@@ -48,4 +48,4 @@ environments:
- "primary"
schedule:
cron: 0 * * * *
- deadline: 1800 # 30 minutes
+ deadlineSeconds: 1800 # 30 minutes
diff --git a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
index 3c2c295e..f83b32b9 100644
--- a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
+++ b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
@@ -281,7 +281,7 @@
},
{
"name": "JOB_EXECUTION_DEADLINE",
- "value": "600s"
+ "value": "1800s"
}
],
"image": "us.gcr.io/sourcegraph-dev/abuse-ban-bot:${var.resolved_image_tag}",
@@ -302,7 +302,7 @@
}
],
"service_account": "${data.terraform_remote_state.cross-stack-reference-input-iam.outputs.cross-stack-output-google_service_accountiam-workload-accountemail}",
- "timeout": "300s",
+ "timeout": "1800s",
"volumes": [
],
"vpc_access": {
@@ -341,7 +341,7 @@
"uniqueId": "job_scheduler"
}
},
- "attempt_deadline": "600s",
+ "attempt_deadline": "1800s",
"depends_on": [
"google_cloud_run_v2_job_iam_member.cloudrun_scheduler_job_invoker"
],
```
## Changelog
- MSP jobs: `schedule.deadline` is deprecated, use the top-level `deadlineSeconds` instead. Configured deadlines are now correctly applied as the Cloud Run job execution timeout as well.
Minor QOL improvement, when you're in the Slack channel the chances are good that you might want the ops docs at some point.
## Test plan
n/a
## Changelog
- MSP-provisioned alerts Slack channels now include a link to the service's generated operational docs for a service (go/msp-ops) in the channel description.
Addresses problem noticed in https://github.com/sourcegraph/managed-services/pull/1486#issuecomment-2137887423
## Test plan
Unit tests
## Changelog
- Fixed an issue with output of `sg msp generate` for MSP jobs with particular schedules changing throughout the week
- MSP jobs schedules now must be between 15 minutes at the most frequent, and every week at the least frequent
Closes CORE-121
The dependency on the generated `tfvars` file is frustrating for first-time MSP setup because it currently requires `-stable=false` to update, and doesn't actually serve any purpose for deploy types other than `subscription` (which uses it to isolate image changes that happen on via GitHub actions). This makes it so that we don't generate, or depend on, the dynamic `tfvars` file unless you are using `subscription`.
I've also added a rollout spec configuration, `initialImageTag`, to make the initial tag we provision environments with configurable (as some services might not publish `insiders` images) - see the docstring.
## Test plan
Inspect output of `sg msp generate -all`
Closes CORE-23 - this change removes the manual `gcloud deploy apply` step previously required to enable MSP rollouts, thanks to a recent release of the Google Terraform provider.
## Test plan
https://github.com/sourcegraph/managed-services/pull/1403
This change adds a `locations: { gcpRegion: "...", gcpLocation: "..." }` configuration to centralize all location-related options. `gcpRegion` specifies regional preferences, while `gcpLocation` specifies multi-regional preferences (for resources that support it - only BigQuery in most cases).
Closes CORE-24 - see issue for some context.
## Test plan
```
sg msp generate -all # no diff
```
```
sg msp schema -output='../managed-services/schema/service.schema.json'
```
When using https://github.com/sourcegraph/sourcegraph/pull/62565, we override test environments that are in CLI mode, which can cause infra to be rolled out by surprise via VCS mode on switch - this change adds an option to respect the existing run mode configuration via `-workspace-run-mode=ignore`.
Thread: https://sourcegraph.slack.com/archives/C06JENN2QBF/p1715256898022469?thread_ts=1715251558.736709&cid=C06JENN2QBF
## Test plan
```
sg msp tfc sync -all
👉 Syncing all environments for all services, including setting ALL workspaces to use run mode "vcs" (use '-workspace-run-mode=ignore' to respect the existing run mode) - are you sure? (y/N) N
❌ aborting
Projects/sourcegraph/managed-services 1 » sg msp tfc sync -all -workspace-run-mode=ignore
👉 Syncing all environments for all services - are you sure? (y/N) y
// ...
```
This change adds an explicit 'HA' toggle for Redis and Cloud SQL, as part of investigation into https://github.com/sourcegraph/managed-services/issues/311:
- `redis.highAvailability`: sets the standard HA mode. Right now, we do this by default to preserve existing behaviour - a follow-up to this PR would be to explicitly set the HA mode on production services, then make this default `false`.
- `postgreSQL.highAvailability`: enables regional mode, and also point-in-time-recovery as required. This could be quite expensive - we have ~$200/mo of Cloud SQL expenses, mostly in CPU before discounts, so this is projected to ~double that bill, but would be the simplest and most reliable way to maintain uptime in the event of a zonal failover.
The plan is not necessarily to immediately make use of `postgreSQL.highAvailability` but have the option open, as the configuration is trivial - we could still develop a manual failover process to adhere to requirements outlined in https://github.com/sourcegraph/managed-services/issues/311
Preparing a summary here: https://www.notion.so/sourcegraph/MSP-service-availability-655e89d164b24727803f5e5a603226d8?pvs=4
## Test plan
`sg msp generate -all` has no diff
* msp: update ops with deployment info
* Update dev/managedservicesplatform/operationdocs/operationdocs.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* add rollout section
* sentence case
* Update dev/managedservicesplatform/operationdocs/operationdocs.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* update test files
---------
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
This change migrates `generate` and `tfc sync` to use our service/env argument getters so that we return more consistent error messages. Errors around non-existent or missing service/env arguments now also provide relevant lists of possible values, such as all available services or all available environments for a valid service argument (see test plan examples).
Hopefully this makes errors easier to understand, as the possible values should give a better hint as to what arguments the command expects.