This change adds:
- telemetry export background jobs: flagged behind `TELEMETRY_GATEWAY_EXPORTER_EXPORT_ADDR`, default empty => disabled
- telemetry redaction: configured in package `internal/telemetry/sensitivemetadataallowlist`
- telemetry-gateway service receiving events and forwarding it to a pub/sub topic (or just logging it, as configured in local dev)
- utilities for easily creating an event recorder: `internal/telemetry/telemetryrecorder`
Notes:
- all changes are feature-flagged to some degree, off by default, so the merge should be fairly low-risk.
- we decided that transmitting the full license key continues to be the way to go. we transmit it once per stream and attach it on all events in the telemetry-gateway. there is no auth mechanism at the moment
- GraphQL return type `EventLog.Source` is now a plain string instead of string enum. This should not be a breaking change in our clients, but must be made so that our generated V2 events do not break requesting of event logs
Stacked on https://github.com/sourcegraph/sourcegraph/pull/56520
Closes https://github.com/sourcegraph/sourcegraph/issues/56289
Closes https://github.com/sourcegraph/sourcegraph/issues/56287
## Test plan
Add an override to make the export super frequent:
```
env:
TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL: "10s"
TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION: "5m"
```
Start sourcegraph:
```
sg start
```
Enable `telemetry-export` featureflag (from https://github.com/sourcegraph/sourcegraph/pull/56520)
Emit some events in GraphQL:
```gql
mutation {
telemetry {
recordEvents(events:[{
feature:"foobar"
action:"view"
source:{
client:"WEB"
}
parameters:{
version:0
}
}]) {
alwaysNil
}
}
```
See series of log events:
```
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/telemetrygatewayexporter.go:61 Telemetry Gateway export enabled - initializing background routines
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:99 exporting events {"maxBatchSize": 10000, "count": 1}
[telemetry-g...y] INFO telemetry-gateway.pubsub pubsub/topic.go:115 Publish {"TraceId": "7852903434f0d2f647d397ee83b4009b", "SpanId": "8d945234bccf319b", "message": "{\"event\":{\"id\":\"dc96ae84-4ac4-4760-968f-0a0307b8bb3d\",\"timestamp\":\"2023-09-19T01:57:13.590266Z\",\"feature\":\"foobar\", ....
```
Build:
```
export VERSION="insiders"
bazel run //cmd/telemetry-gateway:candidate_push --config darwin-docker --stamp --workspace_status_command=./dev/bazel_stamp_vars.sh -- --tag $VERSION --repository us.gcr.io/sourcegraph-dev/telemetry-gateway
```
Deploy: https://github.com/sourcegraph/managed-services/pull/7
Add override:
```yaml
env:
# Port required. TODO: What's the best way to provide gRPC addresses, such that a
# localhost address is also possible?
TELEMETRY_GATEWAY_EXPORTER_EXPORT_ADDR: "https://telemetry-gateway.sgdev.org:443"
```
Repeat the above (`sg start` and emit some events):
```
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:94 exporting events {"maxBatchSize": 10000, "count": 6}
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:113 events exported {"maxBatchSize": 10000, "succeeded": 6}
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:94 exporting events {"maxBatchSize": 10000, "count": 1}
[ worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:113 events exported {"maxBatchSize": 10000, "succeeded": 1}
```
This service is being replaced by a redsync.Mutex that lives directly in the GitHub client.
By this change we will:
- Simplify deployments by removing one service
- Centralize GitHub access control in the client instead of splitting it across services
- Remove the dependency on a non-HA service to talk to GitHub.com successfully
Other repos referencing this service will be updated once this has shipped to dotcom and proven to work over the course of a couple days.
Our first usage of [the recently stabilized OpenTelemetry
metrics](https://opentelemetry.io/docs/specs/otel/metrics/) 😁 Currently
this is Cody-Gateway-specific, nothing is added for Sourcegraph as a
whole.
We add the following:
- If a GCP project is configured, we set up a GCP exporter that pushes
metrics periodically and on shutdown. It's important this is push-based
as Cloud Run instances are ephemeral.
- Otherwise, we set up a Prometheus exporter that works the same as
using the Prometheus SDK, where metrics are exported in `/metrics` (set
up by debugserver) and Prometheus scrapes periodically.
To start off I've added a simple gauge that records concurrent ongoing
requests to upstream Cody Gateway services - see test plan below.
Closes https://github.com/sourcegraph/sourcegraph/issues/53775
## Test plan
I've only tested the Prometheus exporter. Hopefully the GCP one will
"just work" - the configuration is very similar to the one used in the
tracing equivalent, and that one "just worked".
```
sg start dotcom
sg run prometheus
```
See target picked up:
<img width="1145" alt="Screenshot 2023-07-19 at 7 09 31 PM"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/c9aa4c06-c817-400e-9086-c8ed6997844e">
Talk to Cody aggressively:
<img width="1705" alt="image"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/fbda23c7-565f-4a11-ae1b-1bdd8fbceca1">
This might be useful for some customers, but definitely will be useful
for us to write an E2E pipeline for embeddings.
cc @sourcegraph/dev-experience for a quick glance at the
debugserver/prometheus part of this.
## Test plan
Will build the image locally and see if it works alright with the env
var set.
* add initial dashboard for otel
* add failed sent dashboard
* extra panels
* use sum and rate for resource queries
* review comments
* add warning alerts
* Update monitoring/definitions/otel_collector.go
* review comments
* run go generate
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* Update monitoring/definitions/otel_collector.go
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
* review comments
* review feedback also drop two panels
* remove brackets in metrics
* update docs
* fix goimport
* gogenerate
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
This PR moves all the executor queue code into the frontend service. The service no longer needs to run as a singleton and we save one proxy layer while talking from the executor to the queue.
* monitoring: network dashboard narrow by client and by gitserver
* rename var
* fix test
* more dashboards
* prettier
* zoekt indexserver dashboard
* gitserver dashboard
* turn on zoekt-indexer target in dev
* job label for prom targets, new grafana image
Move the long-running and cpu-bound LSIF conversion step into a separate process that consumes a work queue kept in Redis.
This will allow us to scale server and worker replicas independently without worrying about resource over-commit (workers will need more ram/cpu) and will _eventually_ allow us to scale workers without worrying about a write contention in shared SQLite databases. This last step will require that only one worker attaches to a particular queue to handle such work.
* metrics: custom prometheus/grafana docker images
* transfer work to maxx
* Dockerfile config refinements
* dev launch use new prom image
* cleanup prom after goreman ctrl-c
* code review stephen
* add new grafana image
* single container use new sg prom/graf images
* npm run prettier
* docker image READMEs
* grafana tweaks (datasources provisioning)
* forgot to commit this
* dockerfile lints and code review
* go.mod
* revert back to initial versioning approach
* code review stephen