doc/dev: reorganize flakes documentation (#30213)

Co-authored-by: Taylor Sperry <taylor.sperry@sourcegraph.com>
2026-02-06 15:51:43 +00:00 · 2022-01-27 15:06:12 -08:00 · 2022-01-27 15:06:12 -08:00 · cf6ed755a6
commit cf6ed755a6
parent 3b6547a18f
4 changed files with 92 additions and 73 deletions
--- a/.github/ISSUE_TEMPLATE/flaky_test.md
+++ b/.github/ISSUE_TEMPLATE/flaky_test.md
@ -1,18 +0,0 @@
---
-name: Flaky Test
-about: Capture information about a flaky test that has been disabled.
-title: 'Flake: $TEST_NAME disabled'
-labels:
-  - 'testing'
-  - 'flake'
-assignees: ''
-
---
-
- **Name of test:** <!-- Name of the test that was disabled -->
- **Example failure:** <!-- Buildkite link to an example faiure -->
- **PR**: <!-- Link to PR that disabled the test. E.g., #1234 >
-
-#### Additional details
-
-<!-- Notes and/or screenshot describing the problem -->
--- a/.github/ISSUE_TEMPLATE/flaky_test.yaml
+++ b/.github/ISSUE_TEMPLATE/flaky_test.yaml
@ -0,0 +1,36 @@
+name: Flaky test
+description: Capture information about a flaky test that has been disabled.
+title: 'ci/flake: $TEST_NAME disabled'
+labels:
+- 'dx'
+- 'ci/flake'
+- 'testing'
+body:
+- type: input
+  id: test
+  attributes:
+    label: Test
+    description: Name of test or link to test that was disabled
+  validations:
+    required: true
+- type: input
+  id: example
+  attributes:
+    label: Example failure
+    description: Buildkite link to an example faiure
+  validations:
+    required: true
+- type: input
+  id: pr
+  attributes:
+    label: Disabling PR
+    description: Link to PR that disables the test
+  validations:
+    required: true
+- type: textarea
+  id: details
+  attributes:
+    label: Additional details
+    description: Notes and/or screenshot describing the problem, links to log queries indicating repeat occurrences, etc.
+  validations:
+    required: false
--- a/doc/dev/background-information/continuous_integration.md
+++ b/doc/dev/background-information/continuous_integration.md
@ -1,6 +1,6 @@
 # Continuous integration <span class="badge badge-note">SOC2/GN-105</span> <span class="badge badge-note">SOC2/GN-106</span>

-Sourcegraph uses a continuous integration and delivery tool, [Buildkite](#buildkite-pipelines), to help ensure a consistent build, test and deploy process. Software changes are systematically required to complete all steps within the continuous integration tool workflow prior to production deployment, in addition to being [peer reviewed](pull_request_reviews.md).
+Sourcegraph uses a continuous integration and delivery tool, [Buildkite](#buildkite-pipelines), to help ensure a [consistent](#pipeline-health) build, test and deploy process. Software changes are systematically required to complete all steps within the continuous integration tool workflow prior to production deployment, in addition to being [peer reviewed](pull_request_reviews.md).

 Sourcegraph also maintains a variety of tooling on [GitHub Actions](#github-actions) for continuous integration and repository maintainence purposes.

@ -17,7 +17,13 @@ To see what checks will get run against your current branch, use [`sg`](../setup
 sg ci preview
 ```

-### Soft failures
+To learn about making changes to our Buildkite pipelines, see [Pipeline development](#pipeline-development).
+
+### Pipeline steps
+
+A complete reference of all available pipeline steps is not yet available ([#30203](https://github.com/sourcegraph/sourcegraph/issues/30203)). This section contains a high-level documentation about what runs in our pipeline.
+
+#### Soft failures

 <span class="badge badge-note">SOC2/GN-106</span>

@ -42,7 +48,7 @@ You can find all usages of soft failures [with the following queries](https://so

 All other failures are hard failures.

-### Image vulnerability scanning
+#### Image vulnerability scanning

 Our CI pipeline scans uses [Trivy](https://aquasecurity.github.io/trivy/) to scan our Docker images for security vulnerabilities.

@ -65,27 +71,44 @@ We also run [separate vulnerability scans for our infrastructure](https://handbo

 Maintaining [Buildkite pipeline](#buildkite-pipelines) health is a critical part of ensuring we ship a stable product - changes that make it to the `main` branch may be deployed to various Sourcegraph instances, and having a reliable and predictable pipeline is crucial to ensuring bugs do not make it to production environments.

-To enable this, we want to [address flakes as they arise](#flakes) and have tooling to mitigate the impacts of pipeline instability, such as [`buildchecker`](#buildchecker).
+To enable this, we [address flakes as they arise](#flakes) and mitigate the impacts of pipeline instability with [branch locks](#branch-locks).

 > NOTE: Sourcegraph teammates should refer to the [CI incidents playbook](https://handbook.sourcegraph.com/departments/product-engineering/engineering/process/incidents/playbooks/ci#scenarios) for help managing issues with pipeline health.

+#### Branch locks
+
+> WARNING: **A red `main` build is not okay and must be fixed.** Learn more about our `main` branch policy in [Testing principles: Failures on the `main` branch](testing_principles.md#failures-on-the-main-branch).
+
+[`buildchecker`](#buildchecker) is a tool responding to periods of consecutive build failures on the `main` branch Sourcegraph Buildkite pipeline. If it detects a series of failures on the `main` branch, merges to `main` will be restricted to members of the Sourcegraph team who authored the failing commits until the issue is resolved - this is referred to as a "branch lock". When a build passes on `main` again, `buildchecker` will automatically unlock the branch.
+
+**Authors of the most recent failed builds are responsible for investigating failures.** Please refer to the [Continuous integration playbook](https://handbook.sourcegraph.com/departments/product-engineering/engineering/process/incidents/playbooks/ci#build-has-failed-on-the-main-branch) for step-by-step guides on what to do in various scenarios.
+
 #### Flakes

-A flake is generally characterized as one-off or rare issues that can be resolved by retrying the failed job or task. In other words: something that sometimes fails, but if you retry it enough times, it passes, *eventually*.
+A *flake* is defined as a test or script that is unreliable or non-deterministic, i.e. it exhibits both a passing and a failing result with the same code. In other words: something that sometimes fails, but if you retry it enough times, it passes, *eventually*.

-Tests are not the only thing that are flaky - flakes can also encompass sporadic infrastructure issues and other unreliable steps.
+Tests are not the only thing that are flaky - flakes can also encompass [sporadic infrastructure issues](#flaky-infrastructure) and [unreliable steps](#flaky-steps).

 ##### Flaky tests

-Learn more about our flaky test policy in [Testing principles: Flaky tests](testing_principles.md#flaky-tests).
+> WARNING: **We do not tolerate flaky tests of any kind.** Learn more about our flaky test policy in [Testing principles: Flaky tests](testing_principles.md#flaky-tests).

-Use language specific functionality to skip a test. Create an issue and ping an owner about the skipping (normally on the PR skipping it).
+Typical reasons why a test may be flaky:
+
+- Race conditions or timing issues
+- Caching or inconsistent state between tests
+- Unreliable test infrastructure (such as CI)
+- Reliance on third-party services that are inconsistent
+
+If a flaky test is discovered, immediately use language-specific functionality to skip a test and open a PR to disable the test:

 - Go: [`testing.T.Skip`](https://pkg.go.dev/testing#hdr-Skipping)
 - Typescript: [`.skip()`](https://mochajs.org/#inclusive-tests)

 If the language or framework allows for a skip reason, include a link to the issue track re-enabling the test, or leave a docstring with a link.

+Then open an issue to investigate the flaky test (use the [flaky test issue template](https://github.com/sourcegraph/sourcegraph/issues/new/choose)), and assign it to the most likely owner.
+
 ##### Flaky steps

 If a step is flaky we need to get the build back to reliable as soon as possible. If there is not already a discussion in `#buildkite-main` create one and link what step you take. Here are the recommended approaches in order:
@ -107,11 +130,11 @@ An example use of `Skip`:
 }
 ```

-#### `buildchecker`
+##### Flaky infrastructure

-[`buildchecker`](https://github.com/sourcegraph/sourcegraph/actions/workflows/buildchecker.yml) is a tool responding to periods of consecutive build failures on the `main` branch Sourcegraph Buildkite pipeline. If it detects a series of failures on the `main` branch, merges to `main` will be restricted to certain members of the Sourcegraph team until the issue is resolved.
+If the [build or test infrastructure itself is flaky](https://handbook.sourcegraph.com/departments/product-engineering/engineering/enablement/dev-experience#build-pipeline-support), then [open an issue with the `team/devx` label](https://github.com/sourcegraph/sourcegraph/issues/new?labels=team/devx) and notify the [Developer Experience team](https://handbook.sourcegraph.com/departments/product-engineering/engineering/enablement/dev-experience#contact).

-To learn more, refer to the [`buildchecker` source code and documentation](https://github.com/sourcegraph/sourcegraph/tree/main/dev/buildchecker).
+Also see [Buildkite infrastructure](#buildkite-infrastructure).

 ### Pipeline development

@ -123,7 +146,7 @@ To test the rendering of the entire pipeline, you can run `env BUILDKITE_BRANCH=

 #### Pipeline operations

-Pipeline steps are defined as [`Operation`s](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/enterprise/dev/ci/internal/ci/operations/operations.go) that apply changes to the given pipeline, such as adding steps and components.
+[Pipeline steps](#pipeline-steps) are defined as [`Operation`s](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/enterprise/dev/ci/internal/ci/operations/operations.go) that apply changes to the given pipeline, such as adding steps and components.

 ```sgquery
 (:[_] *bk.Pipeline) patternType:structural repo:^github\.com/sourcegraph/sourcegraph$ file:^enterprise/dev/ci/internal/ci/operations\.go
@ -137,11 +160,11 @@ Within an `Operation` you will typically create one or more steps on a pipeline

 Operations are then added to a pipeline from [`GeneratePipeline`](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+file:%5Eenterprise/dev/ci/internal/ci/pipeline%5C.go+GeneratePipeline&patternType=literal).

-For most basic PR checks, see [Creating PR checks](#creating-pr-checks) for how to create your own steps!
+For most basic PR checks, see [Developing PR checks](#developing-pr-checks) for how to create your own steps!

-For more advanced usage for specific run types, see [Run types](#run-types).
+For more advanced usage for specific run types, see [Developing run types](#developing-run-types).

-#### Creating PR checks
+#### Developing PR checks

 To create a new check that can run on pull requests on relevant files, check the [`changed.Files`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/enterprise/dev/ci/internal/ci/changed/changed.go) type to see if a relevant `affectsXyz` check already exists.

@ -156,7 +179,7 @@ Make sure to follow the best practices outlined in docstring.

 For more advanced pipelines, see [Run types](#run-types).

-#### Run types
+#### Developing run types

 There are a variety of run types available based on branch prefixes. These generate special-purpose pipelines. For example, the `main-dry-run/` prefix is used to generate a pipeline similar to the default `main` branch. See [`RunType`](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/enterprise/dev/ci/internal/ci/runtype.go) for the various run types available, and examples for how to add more.

@ -170,6 +193,8 @@ For simple PR checks, see [Creating PR checks](#creating-pr-checks).

 #### Buildkite infrastructure

+Also see [Flaky infrastructure](#flaky-infrastructure), [Continous integration infrastructure](https://handbook.sourcegraph.com/departments/product-engineering/engineering/tools/infrastructure/ci), and the [Continuous integration changelog](https://handbook.sourcegraph.com/departments/product-engineering/engineering/tools/infrastructure/ci/changelog).
+
 ##### Pipeline setup

 To set up Buildkite to use the rendered pipeline, add the following step in the [pipeline settings](https://buildkite.com/sourcegraph/sourcegraph/settings):
@ -188,6 +213,12 @@ The term _secret_ refers to authentication credentials like passwords, API keys,

 ## GitHub Actions

+### `buildchecker`
+
+[`buildchecker`](https://github.com/sourcegraph/sourcegraph/actions/workflows/buildchecker.yml), our [branch lock management tool](#branch-locks), runs in GitHub actions - see the [workflow specification](https://github.com/sourcegraph/sourcegraph/blob/main/.github/workflows/buildchecker.yml).
+
+To learn more about `buildchecker`, refer to the [`buildchecker` source code and documentation](https://github.com/sourcegraph/sourcegraph/tree/main/dev/buildchecker).
+
 ### Third-party licenses

 We use the [`license_finder`](https://github.com/pivotal/LicenseFinder) tool to check third-party dependencies for their licenses. It runs as a [GitHub Action on pull requests](https://github.com/sourcegraph/sourcegraph/actions?query=workflow%3A%22Licenses+Check%22), which will fail if one of the following occur:
@ -213,4 +244,3 @@ LICENSE_CHECK=true ./dev/licenses.sh
 The `./dev/licenses.sh` script will also output some `license_finder` configuration for debugging purposes - this configuration is based on the `doc/dependency_decisions.yml` file, which tracks decisions made about licenses and dependencies.

 For more details, refer to the [`license_finder` documentation](https://github.com/pivotal/LicenseFinder#usage).
-
--- a/doc/dev/background-information/testing_principles.md
+++ b/doc/dev/background-information/testing_principles.md
@ -17,7 +17,7 @@ A good automated test suite increases the velocity of our team because it allows

 Engineers should budget an appropriate amount of time for writing tests when making iteration plans.

-## Testing code
+### Types of tests

 <span class="badge badge-note">SOC2/GN-105</span>

@ -26,53 +26,24 @@ In order to ensure we are true to our [philosphy](#philosophy), we have various
 This includes, but is not limited to:

 - Image vulnerability scanning
- Infrascture as code
+- Infrastructure as code static analyses
 - Unit, integration and end-to-end tests as outlined in the [testing-pyrmid](#testing-pyramid)

 Our goal is to ensure that our product and code work, and that all reasonable effort has been taken to reduce the risk of a security-related incident associated to Sourcegraph.

-Also see [continuous integration](continuous_integration.md).
+Also see [continuous integration](continuous_integration.md) and [internal infrastructure testing](https://handbook.sourcegraph.com/departments/product-engineering/engineering/tools/infrastructure/dev).
+
+## Failures on the `main` branch
+
+**A red `main` build is not okay and must be fixed.** Consecutive failed builds on the `main` branch means that [the releasability contract is broken](https://handbook.sourcegraph.com/engineering/continuous_releasability#continuous-releasability-contract), and that we cannot confidently ship that revision to our customers nor have it deployed in the Cloud environment.

 ## Flaky tests

-A *flaky* test is defined as a test that is unreliable or non-deterministic, i.e. it exhibits both a passing and a failing result with the same code.
-
-Typical reasons why a test may be flaky:
-
- Race conditions or timing issues
- Caching or inconsistent state between tests
- Unreliable test infrastructure (such as CI)
- Reliance on third-party services that are inconsistent
-
-**We do not tolerate flaky tests of any kind.** Any engineer that sees a flaky test in [continuous integration](./continuous_integration.md) should immediately:
-
-1. Open a PR to disable the flaky test.
-1. Open an issue to re-enable the flaky test (use the [Flaky Test template](https://github.com/sourcegraph/sourcegraph/issues/new?assignees=&labels=&template=flaky_test.md&title=Flake%3A+%24TEST_NAME+disabled)), and assign it to the most likely owner, and add it to the current release milestone.
-
-If the build or test infrastructure itself is flaky, then [open an issue](https://github.com/sourcegraph/sourcegraph/issues/new?labels=team/distribution) and notify the [distribution team](https://handbook.sourcegraph.com/engineering/distribution#contact).
+**We do not tolerate flaky tests of any kind.** Any engineer that sees a flaky test in [continuous integration](./continuous_integration.md) should immediately [disable the flaky test](continuous_integration.md#flaky-tests).

 Why are flaky tests undesirable? Because these tests stop being an informative signal that the engineering team can rely on, and if we keep them around then we eventually train ourselves to ignore them and become blind to their results. This can hide real problems under the cover of flakiness.

-## Broken builds on the `main` branch
-
-A red `main` build is not okay and must be fixed. Consecutive failed builds on the `main` branch means that [the releasability contract is broken](https://handbook.sourcegraph.com/engineering/continuous_releasability#continuous-releasability-contract), and that we cannot confidently ship that revision to our customers nor have it deployed in the Cloud environment.
-
-### Process
-
-> In essence: Someone must have eyes on the build failure. Unsure about what's happening? Get help on #buildkite-main.
-
- When a PR breaks the build, the author is responsible for investigating why, and asking for help if necessary:
-  - The failure will appear on [#buildkite-main](https://sourcegraph.slack.com/archives/C02FLQDD3TQ).
-  - If you've done ~30 mins of investigation and the cause is still unclear, ask for help!
-  - Handing the issue over to someone else (for any reason) is totally okay, but it has to happen.
-  - If there's no action being taken after a reasonable amount of time, the offending PR can be reverted by anyone blocked by it.
- If there is reasonable suspicion of a [flake](#flaky-tests) (e.g. can't reproduce the problem locally) or if it’s clear that the cause is not related to the PR:
-  - Rebuild the job.
-  - Notify the team in charge of the concerned test or disable it.
-  - It's a CI flake? Pass ownership to the DX team.
- If there is no immediate fix in sight (or rebuilding didn't fix it):
-  - [Mark the faulty test as skipped or revert the changes](#flaky-tests) to restore the main branch to green and avoid blocking others.
-  - if reverting won't fix because it depends on external resources, just comment out that test and open a ticket mentioning the owners.
+Other kinds of flakes include [flaky steps](continuous_integration.md#flaky-steps) and [flaky infrastructure](continuous_integration.md#laky-infrastructure)

 ## Testing pyramid