This change adds:
- telemetry export background jobs: flagged behind `TELEMETRY_GATEWAY_EXPORTER_EXPORT_ADDR`, default empty => disabled
- telemetry redaction: configured in package `internal/telemetry/sensitivemetadataallowlist`
- telemetry-gateway service receiving events and forwarding it to a pub/sub topic (or just logging it, as configured in local dev)
- utilities for easily creating an event recorder: `internal/telemetry/telemetryrecorder`
Notes:
- all changes are feature-flagged to some degree, off by default, so the merge should be fairly low-risk.
- we decided that transmitting the full license key continues to be the way to go. we transmit it once per stream and attach it on all events in the telemetry-gateway. there is no auth mechanism at the moment
- GraphQL return type `EventLog.Source` is now a plain string instead of string enum. This should not be a breaking change in our clients, but must be made so that our generated V2 events do not break requesting of event logs
Stacked on https://github.com/sourcegraph/sourcegraph/pull/56520
Closes https://github.com/sourcegraph/sourcegraph/issues/56289
Closes https://github.com/sourcegraph/sourcegraph/issues/56287
## Test plan
Add an override to make the export super frequent:
```
env:
TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL: "10s"
TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION: "5m"
```
Start sourcegraph:
```
sg start
```
Enable `telemetry-export` featureflag (from https://github.com/sourcegraph/sourcegraph/pull/56520)
Emit some events in GraphQL:
```gql
mutation {
telemetry {
recordEvents(events:[{
feature:"foobar"
action:"view"
source:{
client:"WEB"
}
parameters:{
version:0
}
}]) {
alwaysNil
}
}
```
See series of log events:
```
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/telemetrygatewayexporter.go:61 Telemetry Gateway export enabled - initializing background routines
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:99 exporting events {"maxBatchSize": 10000, "count": 1}
[telemetry-g...y] INFO telemetry-gateway.pubsub pubsub/topic.go:115 Publish {"TraceId": "7852903434f0d2f647d397ee83b4009b", "SpanId": "8d945234bccf319b", "message": "{\"event\":{\"id\":\"dc96ae84-4ac4-4760-968f-0a0307b8bb3d\",\"timestamp\":\"2023-09-19T01:57:13.590266Z\",\"feature\":\"foobar\", ....
```
Build:
```
export VERSION="insiders"
bazel run //cmd/telemetry-gateway:candidate_push --config darwin-docker --stamp --workspace_status_command=./dev/bazel_stamp_vars.sh -- --tag $VERSION --repository us.gcr.io/sourcegraph-dev/telemetry-gateway
```
Deploy: https://github.com/sourcegraph/managed-services/pull/7
Add override:
```yaml
env:
# Port required. TODO: What's the best way to provide gRPC addresses, such that a
# localhost address is also possible?
TELEMETRY_GATEWAY_EXPORTER_EXPORT_ADDR: "https://telemetry-gateway.sgdev.org:443"
```
Repeat the above (`sg start` and emit some events):
```
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:94 exporting events {"maxBatchSize": 10000, "count": 6}
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:113 events exported {"maxBatchSize": 10000, "succeeded": 6}
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:94 exporting events {"maxBatchSize": 10000, "count": 1}
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:113 events exported {"maxBatchSize": 10000, "succeeded": 1}
```
This service is being replaced by a redsync.Mutex that lives directly in the GitHub client.
By this change we will:
- Simplify deployments by removing one service
- Centralize GitHub access control in the client instead of splitting it across services
- Remove the dependency on a non-HA service to talk to GitHub.com successfully
Other repos referencing this service will be updated once this has shipped to dotcom and proven to work over the course of a couple days.
Our first usage of [the recently stabilized OpenTelemetry
metrics](https://opentelemetry.io/docs/specs/otel/metrics/) 😁 Currently
this is Cody-Gateway-specific, nothing is added for Sourcegraph as a
whole.
We add the following:
- If a GCP project is configured, we set up a GCP exporter that pushes
metrics periodically and on shutdown. It's important this is push-based
as Cloud Run instances are ephemeral.
- Otherwise, we set up a Prometheus exporter that works the same as
using the Prometheus SDK, where metrics are exported in `/metrics` (set
up by debugserver) and Prometheus scrapes periodically.
To start off I've added a simple gauge that records concurrent ongoing
requests to upstream Cody Gateway services - see test plan below.
Closes https://github.com/sourcegraph/sourcegraph/issues/53775
## Test plan
I've only tested the Prometheus exporter. Hopefully the GCP one will
"just work" - the configuration is very similar to the one used in the
tracing equivalent, and that one "just worked".
```
sg start dotcom
sg run prometheus
```
See target picked up:
<img width="1145" alt="Screenshot 2023-07-19 at 7 09 31 PM"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/c9aa4c06-c817-400e-9086-c8ed6997844e">
Talk to Cody aggressively:
<img width="1705" alt="image"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/fbda23c7-565f-4a11-ae1b-1bdd8fbceca1">
This PR ships our freshly rewritten container images built with
rules_oci and Wolfi, which for now will only be used on S2.
*What is this about*
This work is the conjunction of [hardening container
images](https://github.com/orgs/sourcegraph/projects/302?pane=issue&itemId=25019223)
and fully building our container images with Bazel.
* All base images are now distroless, based on Wolfi, meaning we fully
control every little package version and we won't be subject anymore to
Alpine maintainers dropping a postgres version for example.
* Container images are now built with `rules_oci`, meaning we don't have
Dockerfile anymore, but instead created through [Bazel
rules](https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph@bzl/oci_wolfi/-/blob/enterprise/cmd/gitserver/BUILD.bazel).
Don't be scared, while this will look a bit strange to you at first,
it's much saner and simpler to do than our Dockerfiles and their muddy
shell scripts calling themselves in cascade.
:spiral_note_pad: *Plan*:
*1/ (NOW) We merge our branch on `main` today, here is what it does
change for you 👇:skin-tone-3::*
* On `main`:
* It will introduce a new job on `main` _Bazel Push_, which will push
those new images on our registries with all tags prefixed by `bazel-`.
* These new images will be picked up by S2 and S2 only.
* The existing jobs building docker images and pushing them will stay in
place until we have QA'ed them enough and are confident to roll them out
on Dotcom.
* Because we'll be building both images, there will be more jobs running
on `main`, but this should not affect the wall clock time.
* On all branches (so your PRs and `main`)
* The _Bazel Test_ job will now run: Backend Integration Tests, E2E
Tests and CodeIntel QA
* This will increase the duration of your test jobs in PRs, but as we
haven't removed yet the `sg lint` step, it should not affect too much
the wall clock time of your PRs.
* But it will also increase your confidence toward your changes, as the
coverage will vastly increased compared to before.
* If you have ongoing branches which are affecting the docker images
(like adding a new binary, like the recent `scip-tags`, reach us out on
#job-fair-bazel so we can help you to port your changes. It's much much
simpler than before, but it's going to be unfamiliar to you).
* If something goes awfully wrong, we'll rollback and update this
thread.
*2/ (EOW / Early next week) Once we're confident enough with what we saw
on S2, we'll roll the new images on Dotcom.*
* After the first successful deploy and a few sanity checks, we will
drop the old images building jobs.
* At this point, we'll reach out to all TLs asking for their help to
exercise all features of our product to ensure we catch any potential
breakage.
## Test plan
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
* We tested our new images on `scale-testing` and it worked.
* The new container building rules comes with _container tests_ which
ensures that produced images are containing and configured with what
should be in there:
[example](https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph@bzl/oci_wolfi/-/blob/enterprise/cmd/gitserver/image_test.yaml)
.
---------
Co-authored-by: Dave Try <davetry@gmail.com>
Co-authored-by: Will Dollman <will.dollman@sourcegraph.com>
This might be useful for some customers, but definitely will be useful
for us to write an E2E pipeline for embeddings.
cc @sourcegraph/dev-experience for a quick glance at the
debugserver/prometheus part of this.
## Test plan
Will build the image locally and see if it works alright with the env
var set.
* add initial dashboard for otel
* add failed sent dashboard
* extra panels
* use sum and rate for resource queries
* review comments
* add warning alerts
* Update monitoring/definitions/otel_collector.go
* review comments
* run go generate
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* review comments
* review feedback also drop two panels
* remove brackets in metrics
* update docs
* fix goimport
* gogenerate
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
This PR moves all the executor queue code into the frontend service. The service no longer needs to run as a singleton and we save one proxy layer while talking from the executor to the queue.
* monitoring: network dashboard narrow by client and by gitserver
* rename var
* fix test
* more dashboards
* prettier
* zoekt indexserver dashboard
* gitserver dashboard
* turn on zoekt-indexer target in dev
* job label for prom targets, new grafana image
Move the long-running and cpu-bound LSIF conversion step into a separate process that consumes a work queue kept in Redis.
This will allow us to scale server and worker replicas independently without worrying about resource over-commit (workers will need more ram/cpu) and will _eventually_ allow us to scale workers without worrying about a write contention in shared SQLite databases. This last step will require that only one worker attaches to a particular queue to handle such work.
* metrics: custom prometheus/grafana docker images
* transfer work to maxx
* Dockerfile config refinements
* dev launch use new prom image
* cleanup prom after goreman ctrl-c
* code review stephen
* add new grafana image
* single container use new sg prom/graf images
* npm run prettier
* docker image READMEs
* grafana tweaks (datasources provisioning)
* forgot to commit this
* dockerfile lints and code review
* go.mod
* revert back to initial versioning approach
* code review stephen
* dev env: launch prometheus if desired
* remove network
* pair up prometheus with grafana
* declare new files in code owners
* prettier
* sh comment cleanup