Commit Graph

42 Commits

Author SHA1 Message Date
Robert Lin
5fa93155fc
telemetry-gateway: migrate to MSP runtime (#58814)
This change migrates Telemetry Gateway to use the MSP runtime library for service initialization, which now handles Sentry, OpenTelemetry, etc and offers a simpler interface for defining services.

Because we now only expose 1 port (i.e. no debugserver port), I've made the default in local dev `6080`, because my browser was complaining about `10080`.

## Test plan

```sh
sg run telemetry-gateway
curl http://localhost:6080/-/version
# 0.0.0+dev%                                                  
curl http://localhost:6080/-/healthz
# unauthorized%                                              
curl -H 'Authorization: bearer sekret' http://localhost:6080/-/healthz
# healthz: ok%                                            
```

Also visit http://localhost:6080/debug/grpcui/ and http://localhost:6080/metrics, which are expected to be enabled in local dev.

Then try with full Sourcegraph stack:

```
sg start
```

<img width="660" alt="image" src="https://github.com/sourcegraph/sourcegraph/assets/23356519/9e799c58-4d02-4752-9f9f-da3108ba762f">
2023-12-08 12:14:34 -08:00
Robert Lin
8700fae431
msp/runtime: add diagnostics handlers (#58762)
Adds support to MSP runtime for health and version checking. Also splits up the `msp-example` service for better readability, and registers a Prometheus metrics exporter in local dev.

Closes https://github.com/sourcegraph/sourcegraph/issues/58784

## Test plan

```
➜ curl localhost:9080/         
Variable: 13%                                                                                                     
➜ curl localhost:9080/-/healthz
unauthorized%                                                                                                     
➜ curl -H 'Authorization: bearer sekret' localhost:9080/-/healthz
healthz: ok%  
➜ curl localhost:9080/-/version                                  
dev%                   
```

![image](https://github.com/sourcegraph/sourcegraph/assets/23356519/afdf6773-d110-4fba-9366-9bfda25e595b)
2023-12-06 10:54:09 -08:00
Robert Lin
e835a66c76
telemetrygateway: add exporter and service (#56699)
This change adds:

- telemetry export background jobs: flagged behind `TELEMETRY_GATEWAY_EXPORTER_EXPORT_ADDR`, default empty => disabled
- telemetry redaction: configured in package `internal/telemetry/sensitivemetadataallowlist`
- telemetry-gateway service receiving events and forwarding it to a pub/sub topic (or just logging it, as configured in local dev)
- utilities for easily creating an event recorder: `internal/telemetry/telemetryrecorder`
Notes:

- all changes are feature-flagged to some degree, off by default, so the merge should be fairly low-risk.
- we decided that transmitting the full license key continues to be the way to go. we transmit it once per stream and attach it on all events in the telemetry-gateway. there is no auth mechanism at the moment
- GraphQL return type `EventLog.Source` is now a plain string instead of string enum. This should not be a breaking change in our clients, but must be made so that our generated V2 events do not break requesting of event logs

Stacked on https://github.com/sourcegraph/sourcegraph/pull/56520

Closes https://github.com/sourcegraph/sourcegraph/issues/56289
Closes https://github.com/sourcegraph/sourcegraph/issues/56287

## Test plan

Add an override to make the export super frequent:

```
env:
  TELEMETRY_GATEWAY_EXPORTER_EXPORT_INTERVAL: "10s"
  TELEMETRY_GATEWAY_EXPORTER_EXPORTED_EVENTS_RETENTION: "5m"
```

Start sourcegraph:

```
sg start
```

Enable `telemetry-export` featureflag (from https://github.com/sourcegraph/sourcegraph/pull/56520)

Emit some events in GraphQL:

```gql
mutation {
  telemetry {
    recordEvents(events:[{
      feature:"foobar"
      action:"view"
      source:{
        client:"WEB"
      }
      parameters:{
        version:0
      }
    }]) {
      alwaysNil
    }
  }
```

See series of log events:

```
[         worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/telemetrygatewayexporter.go:61 Telemetry Gateway export enabled - initializing background routines
[         worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:99 exporting events {"maxBatchSize": 10000, "count": 1}
[telemetry-g...y] INFO telemetry-gateway.pubsub pubsub/topic.go:115 Publish {"TraceId": "7852903434f0d2f647d397ee83b4009b", "SpanId": "8d945234bccf319b", "message": "{\"event\":{\"id\":\"dc96ae84-4ac4-4760-968f-0a0307b8bb3d\",\"timestamp\":\"2023-09-19T01:57:13.590266Z\",\"feature\":\"foobar\", ....
```

Build:

```
export VERSION="insiders"
bazel run //cmd/telemetry-gateway:candidate_push --config darwin-docker --stamp --workspace_status_command=./dev/bazel_stamp_vars.sh -- --tag $VERSION --repository us.gcr.io/sourcegraph-dev/telemetry-gateway
```

Deploy: https://github.com/sourcegraph/managed-services/pull/7

Add override:

```yaml
env:
  # Port required. TODO: What's the best way to provide gRPC addresses, such that a
  # localhost address is also possible?
  TELEMETRY_GATEWAY_EXPORTER_EXPORT_ADDR: "https://telemetry-gateway.sgdev.org:443"
```

Repeat the above (`sg start` and emit some events):

```
[         worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:94 exporting events {"maxBatchSize": 10000, "count": 6}
[         worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:113 events exported {"maxBatchSize": 10000, "succeeded": 6}
[         worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:94 exporting events {"maxBatchSize": 10000, "count": 1}
[         worker] INFO worker.telemetrygateway-exporter telemetrygatewayexporter/exporter.go:113 events exported {"maxBatchSize": 10000, "succeeded": 1}
```
2023-09-20 05:20:15 +00:00
Erik Seliger
711ee1a495
Remove GitHub proxy service (#56485)
This service is being replaced by a redsync.Mutex that lives directly in the GitHub client.
By this change we will:
- Simplify deployments by removing one service
- Centralize GitHub access control in the client instead of splitting it across services
- Remove the dependency on a non-HA service to talk to GitHub.com successfully

Other repos referencing this service will be updated once this has shipped to dotcom and proven to work over the course of a couple days.
2023-09-14 19:43:40 +02:00
Robert Lin
294fe3df22
cody-gateway: push GCP metrics or publish Prometheus metrics via OpenTelemetry (#55134)
Our first usage of [the recently stabilized OpenTelemetry
metrics](https://opentelemetry.io/docs/specs/otel/metrics/) 😁 Currently
this is Cody-Gateway-specific, nothing is added for Sourcegraph as a
whole.

We add the following:

- If a GCP project is configured, we set up a GCP exporter that pushes
metrics periodically and on shutdown. It's important this is push-based
as Cloud Run instances are ephemeral.
- Otherwise, we set up a Prometheus exporter that works the same as
using the Prometheus SDK, where metrics are exported in `/metrics` (set
up by debugserver) and Prometheus scrapes periodically.

To start off I've added a simple gauge that records concurrent ongoing
requests to upstream Cody Gateway services - see test plan below.

Closes https://github.com/sourcegraph/sourcegraph/issues/53775

## Test plan

I've only tested the Prometheus exporter. Hopefully the GCP one will
"just work" - the configuration is very similar to the one used in the
tracing equivalent, and that one "just worked".

```
sg start dotcom
sg run prometheus
```

See target picked up:

<img width="1145" alt="Screenshot 2023-07-19 at 7 09 31 PM"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/c9aa4c06-c817-400e-9086-c8ed6997844e">

Talk to Cody aggressively:

<img width="1705" alt="image"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/fbda23c7-565f-4a11-ae1b-1bdd8fbceca1">
2023-07-20 20:35:16 +00:00
Jean-Hadrien Chabran
58da6780d7
Switch to OCI/Wolfi based image (#52693)
This PR ships our freshly rewritten container images built with
rules_oci and Wolfi, which for now will only be used on S2.

*What is this about*

This work is the conjunction of [hardening container
images](https://github.com/orgs/sourcegraph/projects/302?pane=issue&itemId=25019223)
and fully building our container images with Bazel.

* All base images are now distroless, based on Wolfi, meaning we fully
control every little package version and we won't be subject anymore to
Alpine maintainers dropping a postgres version for example.

* Container images are now built with `rules_oci`, meaning we don't have
Dockerfile anymore, but instead created through [Bazel
rules](https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph@bzl/oci_wolfi/-/blob/enterprise/cmd/gitserver/BUILD.bazel).
Don't be scared, while this will look a bit strange to you at first,
it's much saner and simpler to do than our Dockerfiles and their muddy
shell scripts calling themselves in cascade.


:spiral_note_pad:  *Plan*:

*1/ (NOW) We merge our branch on `main` today, here is what it does
change for you 👇:skin-tone-3::*

* On `main`: 
* It will introduce a new job on `main` _Bazel Push_, which will push
those new images on our registries with all tags prefixed by `bazel-`.
    * These new images will be picked up by S2 and S2 only. 
* The existing jobs building docker images and pushing them will stay in
place until we have QA'ed them enough and are confident to roll them out
on Dotcom.
* Because we'll be building both images, there will be more jobs running
on `main`, but this should not affect the wall clock time.
* On all branches (so your PRs and `main`)
* The _Bazel Test_ job will now run: Backend Integration Tests, E2E
Tests and CodeIntel QA
* This will increase the duration of your test jobs in PRs, but as we
haven't removed yet the `sg lint` step, it should not affect too much
the wall clock time of your PRs.
* But it will also increase your confidence toward your changes, as the
coverage will vastly increased compared to before.
* If you have ongoing branches which are affecting the docker images
(like adding a new binary, like the recent `scip-tags`, reach us out on
#job-fair-bazel so we can help you to port your changes. It's much much
simpler than before, but it's going to be unfamiliar to you).

* If something goes awfully wrong, we'll rollback and update this
thread.

*2/ (EOW / Early next week) Once we're confident enough with what we saw
on S2, we'll roll the new images on Dotcom.*

* After the first successful deploy and a few sanity checks, we will
drop the old images building jobs.
* At this point, we'll reach out to all TLs asking for their help to
exercise all features of our product to ensure we catch any potential
breakage.



## Test plan

<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->


* We tested our new images on `scale-testing` and it worked.
* The new container building rules comes with _container tests_ which
ensures that produced images are containing and configured with what
should be in there:
[example](https://sourcegraph.sourcegraph.com/github.com/sourcegraph/sourcegraph@bzl/oci_wolfi/-/blob/enterprise/cmd/gitserver/image_test.yaml)
.

---------

Co-authored-by: Dave Try <davetry@gmail.com>
Co-authored-by: Will Dollman <will.dollman@sourcegraph.com>
2023-06-02 12:12:52 +02:00
Erik Seliger
fcefe5b372
Add embeddings to server behind env var (#50288)
This might be useful for some customers, but definitely will be useful
for us to write an E2E pipeline for embeddings.

cc @sourcegraph/dev-experience for a quick glance at the
debugserver/prometheus part of this.

## Test plan

Will build the image locally and see if it works alright with the env
var set.
2023-04-04 16:45:50 +02:00
William Bezuidenhout
6c7389f37c
otel: add collector dashboard (#45009)
* add initial dashboard for otel

* add failed sent dashboard

* extra panels

* use sum and rate for resource queries

* review comments

* add warning alerts

* Update monitoring/definitions/otel_collector.go

* review comments

* run go generate

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* Update monitoring/definitions/otel_collector.go

Co-authored-by: Robert Lin <robert@bobheadxi.dev>

* review comments

* review feedback also drop two panels

* remove brackets in metrics

* update docs

* fix goimport

* gogenerate

Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
2022-12-19 13:18:51 +01:00
Erik Seliger
82443158d9
sg: Fix prometheus scraping for local dev (#43703)
This was missing github-proxy and the new dual deployment of gitserver.
2022-10-31 17:22:27 +00:00
Erik Seliger
dcbd01f545
Push executor metrics (#36969) 2022-08-03 12:08:04 +02:00
Camden Cheek
de8ae5ee28
Remove query runner (#28333)
This removes the query runner service. Followup work will remove all code around saved search notifications and update the graphql API.
2021-11-30 10:13:20 -07:00
Erik Seliger
713819a4f2
Incorporate executor-queue into frontend server (#23239)
This PR moves all the executor queue code into the frontend service. The service no longer needs to run as a singleton and we save one proxy layer while talking from the executor to the queue.
2021-07-27 19:01:49 +02:00
Erik Seliger
d39d12d066
Fix running codeintel and batches executor in parallel (#22612) 2021-07-06 17:14:39 +02:00
Eric Fritz
91940a0a8d
worker: Add skeleton service (#21768) 2021-06-04 14:48:13 -05:00
Ryan Hitchman
a7c562f374 dev: monitor zoekt-indexserver-1 and zoekt-webserver-* with prometheus 2021-06-03 10:35:40 -06:00
uwedeportivo
7143f35239
uniform prometheus job label for dev env (#18152) 2021-02-10 13:46:33 -08:00
Eric Fritz
4ba6132ed2
codeintel: Increase background task observability (#16739) 2020-12-15 12:01:38 -06:00
Eric Fritz
1f46817d6d
codeintel: Remove bundle manager (#15490) 2020-11-09 10:31:26 -06:00
Eric Fritz
893e5a6af4
codeintel: Remove indexer service (#15135) 2020-11-02 16:14:31 -06:00
Eric Fritz
1ff9c72624
codeintel: Remove precise-code-intel-indexer-vm (#15123) 2020-10-29 08:09:04 -05:00
Eric Fritz
092065b79a
executor: Extract executor from precise-code-intel-executor-vm (#14883) 2020-10-26 10:13:18 -05:00
Eric Fritz
357a5430dc
executor: Extract executor-queue from precise-code-intel-indexer (#14882) 2020-10-26 09:56:58 -05:00
Eric Fritz
a08e1cce77
codeintel: VM-based indexer service (#12723) 2020-08-10 14:30:57 -05:00
Rijnard van Tonder
dd1d5dd5f3
remove replacer service (#12812) 2020-08-07 11:28:37 -07:00
Rijnard van Tonder
3a380cdd42
Revert "remove replacer (#12480)" (#12541) 2020-07-29 11:53:46 -07:00
Rijnard van Tonder
7d6cafd040
remove replacer (#12480) 2020-07-28 14:27:37 -07:00
Eric Fritz
045265de8a
codeintel: Create auto-indexer service skeleton (#10884) 2020-05-25 19:10:56 -05:00
Eric Fritz
999207eb1d
Remove precise-code-intel-api-server service (#10906) 2020-05-21 16:08:05 -05:00
Eric Fritz
22c88de606
Switch from TypeScript to Go precise-code-intel services (#10529) 2020-05-11 11:01:36 -05:00
Eric Fritz
60d7f713e4
Rename and move lsif directory (#9366) 2020-03-27 11:36:13 -05:00
Eric Fritz
2e01c2909e
LSIF: Rename lsif-server (the process) to lsif-api-server (#9259) 2020-03-26 20:08:28 -05:00
Eric Fritz
fc5d12773b
LSIF: Rename lsif-dump-manager to lsif-bundle-manager (#9258) 2020-03-26 17:06:23 -05:00
Eric Fritz
5e01a0d22b
LSIF: Rename lsif-dump-processor to lsif-worker (#9257) 2020-03-26 16:42:19 -05:00
Eric Fritz
9d46b9a8b6
LSIF: Add skeleton dump-manager (#9081) 2020-03-19 12:12:24 -05:00
Rijnard van Tonder
924488766a Start replacer service in server image (#6712)
* start replacer

* Define global debugserver port for replacer service
2019-11-20 19:16:40 +01:00
uwedeportivo
16745fa25b
monitoring: add postgres_exporter process to dev and single server (#6616)
* monitoring: add postgres_exporter process to dev and single server and postgres grafana dashboard

* prettier

* pin postgres_exporter docker image
2019-11-14 16:52:16 -08:00
uwedeportivo
36e6cf1b15
monitoring: network dashboard narrow by client and by gitserver (#5972)
* monitoring: network dashboard narrow by client and by gitserver

* rename var

* fix test

* more dashboards

* prettier

* zoekt indexserver dashboard

* gitserver dashboard

* turn on zoekt-indexer target in dev

* job label for prom targets, new grafana image
2019-10-14 10:04:04 -07:00
Eric Fritz
e4ab447c1e
LSIF: Split server and worker (#5525)
Move the long-running and cpu-bound LSIF conversion step into a separate process that consumes a work queue kept in Redis.

This will allow us to scale server and worker replicas independently without worrying about resource over-commit (workers will need more ram/cpu) and will _eventually_ allow us to scale workers without worrying about a write contention in shared SQLite databases. This last step will require that only one worker attaches to a particular queue to handle such work.
2019-09-23 09:04:20 -05:00
Eric Fritz
85e89e8bb2
LSIF server metrics (#5387)
Add basic metrics to the LSIF service.
2019-09-13 16:07:36 -05:00
uwedeportivo
61776129d4
metrics: custom prometheus/grafana docker images (#5343)
* metrics: custom prometheus/grafana docker images

* transfer work to maxx

* Dockerfile config refinements

* dev launch use new prom image

* cleanup prom after goreman ctrl-c

* code review stephen

* add new grafana image

* single container use new sg prom/graf images

* npm run prettier

* docker image READMEs

* grafana tweaks (datasources provisioning)

* forgot to commit this

* dockerfile lints and code review

* go.mod

* revert back to initial versioning approach

* code review stephen
2019-09-06 22:57:51 -07:00
uwedeportivo
d8978a17be
single container: add prometheus process (#5131)
* single container: add prometheus process

* change localhost to 127.0.0.1

* use --from correctly
2019-08-08 10:12:38 -07:00
uwedeportivo
182acb6307
dev env: launch prometheus if desired (#4963)
* dev env: launch prometheus if desired

* remove network

* pair up prometheus with grafana

* declare new files in code owners

* prettier

* sh comment cleanup
2019-08-05 13:50:27 -07:00