<!-- PR description tips:
https://www.notion.so/sourcegraph/Write-a-good-pull-request-description-610a7fd3e613496eb76f450db5a49b6e
-->
Opsgenie alert notifications for critical alerts should be enabled by
default for production projects or where `env.alerting.opsgenie` is set
to true.
Closes CORE-223
## Test plan
Tested locally by running `sg msp gen` for a `prod` env which doesn't
have an alerting config and verifying that notification suppression was
disabled
Set `env.alerting.opsgenie` to false which enabled suppression again.
No changes to `test` environments unless `env.alerting.opsgenie` is set
to true.
Adds a new `postgreSQL.logicalReplication` configuration to allow MSP to
generate prerequisite setup for integration with Datastream:
https://cloud.google.com/datastream/docs/sources-postgresql. Integration
with Datastream allows the Data Analytics team to self-serve data
enrichment needs for the Telemetry V2 pipeline.
Enabling this feature entails downtime (Cloud SQL instance restart), so
enabling the logical replication feature at the Cloud SQL level
(`cloudsql.logical_decoding`) is gated behind
`postgreSQL.logicalReplication: {}`.
Setting up the required stuff in Postgres is a bit complicated,
requiring 3 Postgres provider instances:
1. The default admin one, authenticated with our admin user
2. New: a workload identity provider, using
https://github.com/cyrilgdn/terraform-provider-postgresql/pull/448 /
https://github.com/sourcegraph/managed-services-platform-cdktf/pull/11.
This is required for creating a publication on selected tables, which
requires being owner of said table. Because tables are created by
application using e.g. auto-migrate, the workload identity is always the
table owner, so we need to impersonate the IAM user
3. New: a "replication user" which is created with the replication
permission. Replication seems to not be a propagated permission so we
need a role/user that has replication enabled.
A bit more context scattered here and there in the docstrings.
Beyond the Postgres configuration we also introduce some additional
resources to enable easy Datastream configuration:
1. Datastream Private Connection, which peers to the service private
network
2. Cloud SQL Proxy VM, which only allows connections to `:5432` from the
range specified in 1, allowing a connection to the Cloud SQL instance
2. Datastream Connection Profile attached to 1
From there, data team can click-ops or manage the Datastream Stream and
BigQuery destination on their own.
Closes CORE-165
Closes CORE-212
Sample config:
```yaml
resources:
postgreSQL:
databases:
- "primary"
logicalReplication:
publications:
- name: testing
database: primary
tables:
- users
```
## Test plan
https://github.com/sourcegraph/managed-services/pull/1569
## Changelog
- MSP services can now configure `postgreSQL.logicalReplication` to
enable Data Analytics team to replicate selected database tables into
BigQuery.
1. The dashboard link still points to the old `go/msp-ops/...` which no
longer work (CORE-105)
2. Alerts defined on top of the MSP defaults are probably of more
interest, so let's sort these in front of the others
## Test plan
Unit/golden tests
The GCP monitoring alert configuration expects, for some reason, a
single-line PromQL query only, otherwise the threshold doesn't work. In
configuration, however, we may want to write a multi-line query, for
ease of readability. This change automatically flattens the PromQL query
into a single line and strips extra spaces.
Part of CORE-161
## Test plan
Unit tests
Deleting Notion pages takes a very long time, and is prone to breaking in the page deletion step, where we must delete blocks one at a time because Notion does not allow for bulk block deletions. The errors seem to generally just be random Notion internal errors. This is very bad because it leaves go/msp-ops pages in an unusable state.
To try and mitigate, we add several places to blindly retry:
1. At the Notion SDK level, where a config option is available for retrying 429 errors
2. At the "reset page" helper level, where a failure to reset a page will prompt a retry of the whole helper
3. At the "delete blocks" helper level, where individual block deletion failures will be retried
Attempt to mitigate https://linear.app/sourcegraph/issue/CORE-119
While here, I also made some other QOL tweaks:
- Fix timing of sub-tasks in CLI output
- Bump default concurrency to 5 (our retries will handle if this is too aggressive, hopefully)
- Fix a missing space in generated docs
## Test plan
```
sg msp ops generate-handbook-pages
```
Follow-ups for #62885:
- Better docstrings for `mql`, `promql`
- `duration` -> `durationMinutes` to align with other config
- `alertpolicy.ResponseCodeMetric` -> `spec.CustomAlertCondition`: they're effectively the same type
Test plan: CI
In a rushed POC of MSP jobs, I did some pretty bad copy-pasting (evidenced by all the service-specific docstrings I have removed in this PR) and made a bad configuration decision here, resulting in a few issues:
1. `schedule.deadline` is not actually applied to Cloud Run jobs, causing jobs to time out earlier than desired
2. `schedule.deadline` is not the right place to configure a deadline, because _all_ jobs need a configurable deadline, not just those with schedules. This change moves `schedule.deadline` to `deadlineSeconds`.
Closes CORE-145
## Test plan
```
$ sg msp generate gatekeeper prod
$ git diff
```
```diff
diff --git a/services/gatekeeper/service.yaml b/services/gatekeeper/service.yaml
index fd6a3812..ce4b02e3 100644
--- a/services/gatekeeper/service.yaml
+++ b/services/gatekeeper/service.yaml
@@ -48,4 +48,4 @@ environments:
- "primary"
schedule:
cron: 0 * * * *
- deadline: 1800 # 30 minutes
+ deadlineSeconds: 1800 # 30 minutes
diff --git a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
index 3c2c295e..f83b32b9 100644
--- a/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
+++ b/services/gatekeeper/terraform/prod/stacks/cloudrun/cdk.tf.json
@@ -281,7 +281,7 @@
},
{
"name": "JOB_EXECUTION_DEADLINE",
- "value": "600s"
+ "value": "1800s"
}
],
"image": "us.gcr.io/sourcegraph-dev/abuse-ban-bot:${var.resolved_image_tag}",
@@ -302,7 +302,7 @@
}
],
"service_account": "${data.terraform_remote_state.cross-stack-reference-input-iam.outputs.cross-stack-output-google_service_accountiam-workload-accountemail}",
- "timeout": "300s",
+ "timeout": "1800s",
"volumes": [
],
"vpc_access": {
@@ -341,7 +341,7 @@
"uniqueId": "job_scheduler"
}
},
- "attempt_deadline": "600s",
+ "attempt_deadline": "1800s",
"depends_on": [
"google_cloud_run_v2_job_iam_member.cloudrun_scheduler_job_invoker"
],
```
## Changelog
- MSP jobs: `schedule.deadline` is deprecated, use the top-level `deadlineSeconds` instead. Configured deadlines are now correctly applied as the Cloud Run job execution timeout as well.
Minor QOL improvement, when you're in the Slack channel the chances are good that you might want the ops docs at some point.
## Test plan
n/a
## Changelog
- MSP-provisioned alerts Slack channels now include a link to the service's generated operational docs for a service (go/msp-ops) in the channel description.
Addresses problem noticed in https://github.com/sourcegraph/managed-services/pull/1486#issuecomment-2137887423
## Test plan
Unit tests
## Changelog
- Fixed an issue with output of `sg msp generate` for MSP jobs with particular schedules changing throughout the week
- MSP jobs schedules now must be between 15 minutes at the most frequent, and every week at the least frequent
Closes CORE-121
The dependency on the generated `tfvars` file is frustrating for first-time MSP setup because it currently requires `-stable=false` to update, and doesn't actually serve any purpose for deploy types other than `subscription` (which uses it to isolate image changes that happen on via GitHub actions). This makes it so that we don't generate, or depend on, the dynamic `tfvars` file unless you are using `subscription`.
I've also added a rollout spec configuration, `initialImageTag`, to make the initial tag we provision environments with configurable (as some services might not publish `insiders` images) - see the docstring.
## Test plan
Inspect output of `sg msp generate -all`
Closes CORE-23 - this change removes the manual `gcloud deploy apply` step previously required to enable MSP rollouts, thanks to a recent release of the Google Terraform provider.
## Test plan
https://github.com/sourcegraph/managed-services/pull/1403
This change adds a `locations: { gcpRegion: "...", gcpLocation: "..." }` configuration to centralize all location-related options. `gcpRegion` specifies regional preferences, while `gcpLocation` specifies multi-regional preferences (for resources that support it - only BigQuery in most cases).
Closes CORE-24 - see issue for some context.
## Test plan
```
sg msp generate -all # no diff
```
```
sg msp schema -output='../managed-services/schema/service.schema.json'
```
When using https://github.com/sourcegraph/sourcegraph/pull/62565, we override test environments that are in CLI mode, which can cause infra to be rolled out by surprise via VCS mode on switch - this change adds an option to respect the existing run mode configuration via `-workspace-run-mode=ignore`.
Thread: https://sourcegraph.slack.com/archives/C06JENN2QBF/p1715256898022469?thread_ts=1715251558.736709&cid=C06JENN2QBF
## Test plan
```
sg msp tfc sync -all
👉 Syncing all environments for all services, including setting ALL workspaces to use run mode "vcs" (use '-workspace-run-mode=ignore' to respect the existing run mode) - are you sure? (y/N) N
❌ aborting
Projects/sourcegraph/managed-services 1 » sg msp tfc sync -all -workspace-run-mode=ignore
👉 Syncing all environments for all services - are you sure? (y/N) y
// ...
```
This change adds an explicit 'HA' toggle for Redis and Cloud SQL, as part of investigation into https://github.com/sourcegraph/managed-services/issues/311:
- `redis.highAvailability`: sets the standard HA mode. Right now, we do this by default to preserve existing behaviour - a follow-up to this PR would be to explicitly set the HA mode on production services, then make this default `false`.
- `postgreSQL.highAvailability`: enables regional mode, and also point-in-time-recovery as required. This could be quite expensive - we have ~$200/mo of Cloud SQL expenses, mostly in CPU before discounts, so this is projected to ~double that bill, but would be the simplest and most reliable way to maintain uptime in the event of a zonal failover.
The plan is not necessarily to immediately make use of `postgreSQL.highAvailability` but have the option open, as the configuration is trivial - we could still develop a manual failover process to adhere to requirements outlined in https://github.com/sourcegraph/managed-services/issues/311
Preparing a summary here: https://www.notion.so/sourcegraph/MSP-service-availability-655e89d164b24727803f5e5a603226d8?pvs=4
## Test plan
`sg msp generate -all` has no diff
* msp: update ops with deployment info
* Update dev/managedservicesplatform/operationdocs/operationdocs.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* add rollout section
* sentence case
* Update dev/managedservicesplatform/operationdocs/operationdocs.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* update test files
---------
Co-authored-by: Robert Lin <robert@bobheadxi.dev>