This PR adds an initializer function that will be called from the svcmain package through the call to tracer.Init.
The reasoning behind this is that it's very easy to accidentally use a package that uses conf internally from a service like Cody Gateway, Appliance, Migrator, or other MSP services and just because we don't have any config source for the trace ID should not let the process stall entirely.
To make it absolutely clear that a dependency is safe to use from a conf perspective, this change indicates that well by dropping the dependency on it entirely and making it another packages concern to pass this type down.
Test plan:
All tests for Sourcegraph are still passing.
---------
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
Previously, some of these validators would not fire correctly, because some packages aren't imported from cmd/frontend.
The package that contributes validators MUST be imported into frontend, which is not guaranteed. By colocating the GetWarnings func and the validators in one package, this is effectively enforced.
There are a few left that were a little harder to migrate, but this feels like a useful first step (and the end of my time window for this rn 😬)
Test plan:
Moved code around, my instance can still start fine and doesn't report new errors.
In dotcom, the OTEL SDK reports:
```
traces export: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8
```
This isn't super helpful for narrowing down where these invalid attributes are coming from.
Luckily, we have some middleware providing some slightly more detailed diagnostics in `grpc/internalerrs`, including detailed logging for this exact problem - this hooks that middleware into the OTEL SDK to help us pin down the source of this issue.
## Test plan
```
sg start
```
Tried a `trace=1` search, nothing bad happened. This middleware is already used extensively elsewhere, and we have an escape hatch via `SRC_GRPC_INTERNAL_ERROR_LOGGING_ENABLED`. Confirmed there's no logspam from failed export due to lack of an active collector in local dev.
Adds OTEL Go SDK logging enablement as part of the `tracing.debug` option. In particular, this is the only way to get the spans dropped by the batch span processor in the SDK: 1d1ecbc5f9/sdk/trace/batch_span_processor.go (L287C1-L288C1)
To make sure this is usable, I've also removed the use of the sync span exporter on debug mode. This is rarely used and shouldn't be used in production, and knobs are available for making tracing "faster" (if needed) in local dev via `OTEL_BSP_*` env vars instead.
Part of https://sourcegraph.slack.com/archives/C04MYFW01NV/p1706107280411819
## Test plan
```
sg run otel-collector jaeger
sg start
```
Logs are available under `tracer.otel` scope:
```
[ embeddings] INFO tracer.otel global/internal_logging.go:52 "level"=4 "msg"="Tracer created" "name"="github.com/XSAM/otelsql" "version"="0.27.0" "schemaURL"=""
```
When debug mode is disabled, these logs don't show up
This is prep for an upgrade of our otel-collector itself - we are adding a dependency on a beta receiver in https://github.com/sourcegraph/sourcegraph/pull/59561 , and want to make sure we are up to date.
This change also gets us to the stable metrics API, which is nice.
We see the OpenTelemetry debug logs emit a lot of output in Cloud - it's not all that important, and we already register an error handler, so this sets the default OpenTelemetry output to discard.
This change updates our internal tracer to enforce policy only via a `Sampler` implementation. This has the following benefits:
1. Even when a trace should not be sampled, contexts are still populated with valid spans, rather than no-op ones. This is important to make use of trace IDs for non-tracing purposes, e.g. https://github.com/sourcegraph/sourcegraph/pull/57774 and https://github.com/sourcegraph/sourcegraph/pull/58060
2. We enforce trace policies the way they were meant to be enforced in OpenTelemetry: by simply indicating that the span should not be exported.
This was not possible before because OpenTracing did not use context propagation, hence we did not have a way to use trace policy flags set in context - but thanks to @camdencheek's work removing OpenTracing entirely, we can now do this in a more idiomatic fashion.
Thanks to this, I've removed a few places that prevented trace context from being populated based on trace policy (HTTP and GraphQL middleware, and `internal/trace`). This delegates sampling decisions to the sampler, and ensures we accept valid trace context everywhere.
## Test plan
Unit tests on a TracerProvider configured with the new sampler.
Manual testing:
```
sg run jaeger otel-collector
sg start
```
Setting `observability.tracing.debug` to `true` we can see logs indicating the desired traits for non-`ShouldTrace` traces:
```
[ worker] INFO tracer tracer/logged_otel.go:63 Start {"spanName": "workerutil.dbworker.store.insights_query_runner_jobs_store.dequeue", "isRecording": false, "isSampled": false, "isValid": true}
```
With `observability.tracing.sampling` set to `none`, running a search with `&trace=1` only gives us spans from zoekt, which seems to have always been outside our rules here.
With `observability.tracing.sampling` set to `selective`, running a search with `&trace=1` gives us a full trace.
With `observability.tracing.sampling` set to `all`, Jaeger instantly gets loads of traces, and in logs we see:
```
[ worker] INFO tracer tracer/logged_otel.go:63 Start {"spanName": "workerutil.dbworker.store.exhaustive_search_worker_store.dequeue", "isRecording": true, "isSampled": true, "isValid": true}
```
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
* log: remove use of description paramter in Scoped
* temporarily point to sglog branch
* bazel configure + gazelle
* remove additional use of description param
* use latest versions of zoekt,log,mountinfo
* go.mod
Our first usage of [the recently stabilized OpenTelemetry
metrics](https://opentelemetry.io/docs/specs/otel/metrics/) 😁 Currently
this is Cody-Gateway-specific, nothing is added for Sourcegraph as a
whole.
We add the following:
- If a GCP project is configured, we set up a GCP exporter that pushes
metrics periodically and on shutdown. It's important this is push-based
as Cloud Run instances are ephemeral.
- Otherwise, we set up a Prometheus exporter that works the same as
using the Prometheus SDK, where metrics are exported in `/metrics` (set
up by debugserver) and Prometheus scrapes periodically.
To start off I've added a simple gauge that records concurrent ongoing
requests to upstream Cody Gateway services - see test plan below.
Closes https://github.com/sourcegraph/sourcegraph/issues/53775
## Test plan
I've only tested the Prometheus exporter. Hopefully the GCP one will
"just work" - the configuration is very similar to the one used in the
tracing equivalent, and that one "just worked".
```
sg start dotcom
sg run prometheus
```
See target picked up:
<img width="1145" alt="Screenshot 2023-07-19 at 7 09 31 PM"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/c9aa4c06-c817-400e-9086-c8ed6997844e">
Talk to Cody aggressively:
<img width="1705" alt="image"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/fbda23c7-565f-4a11-ae1b-1bdd8fbceca1">
This removes the remainder of opentracing-related code from
sourcegraph/sourcegraph. We still use an opentracing propagator so that
downstream services that use opentracing (cough cough Zoekt) will still
work (but Zoekt is weird in a lot of ways and is kind of broken already
in some cases).
The previous approach to enable race detection was too radical and
accidently led to build our binaries with the race flage enabled, which
caused issues when building images down the line.
This happened because putting a `test --something` in bazelrc also sets
it on `build` which is absolutely not what we wanted. Usually folks get
this one working by having a `--stamp` config setting that fixes this
when releasing binaries, which we don't at this stage, as we're still
learning Bazel.
Luckily, this was caught swiftly. The current approach insteads takes a
more granular approach, which makes the `go_test` rule uses our own
variant, which injects the `race = "on"` attribute, but only on
`go_test`.
## Test plan
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
CI, being a main-dry-run, this will cover the container building jobs,
which were the ones failing.
---------
Co-authored-by: Alex Ostrikov <alex.ostrikov@sourcegraph.com>
The primary goal of this change is to allow services more flexibility if
they opt out of `conf` integration. It removes the public-facing
`useConfPackage` argument in favour of:
1. `svcmain.SingleServiceMain`, which assumes site configuration access
and initialization.
2. `svcmain.SingleServiceMainWithoutConf`, which does _not_ assume site
configuration access, and allows custom hooks for
site-config-dependent-configuration via a new options struct
`OutOfBandConfiguration`, allowing callers to set up custom
configuration given a lack of `conf` package access
This allows services like LLM-proxy (#50726) and maybe even exeuctors
(the only other service to have `useConfPackage=false` at the moment) to
set up custom configuration for e.g. Sentry sinks on their own terms.
Some other changes:
- Remove the `validateConfig` argument - this was only set to false in
one place, and it seems to belong better in the existing
`svcmain.Config` struct, where it now lives as `SkipValidate`
- Tracing package now accepts an interface with a smaller scope, rather
than all site configuration
- Add `LLM_PROXY_SENTRY_DSN` using the new configuration options
## Test plan
CI, `sg start`, `sg start app`
1. Adds local tracing option for LLM-proxy that sends things to
OpenTelemetry
- Moves exporter constructors to non-internal package for reuse
2. Adds propagation compatibility with Sourcegraph
- Moves propagator configuration to a package,
`internal/tracer/oteldefaults`, for import by LLM-proxy
3. Adds some comments to various parts of tracing internals
## Test plan
`&trace=1` and do a Cody interaction:

The new architecture of `internal/tracer`:
1. Creates a `TracerProvider` immediately. This provider is
preconfigured with service metadata, but does not have a
`SpanProcessor`.
2. Creates an OpenTracing bridge using `TracerProvider`, such that we
have an OpenTracing Tracer and an OpenTelemetry TracerProvider that
bridges the two APIs. We register both as global defaults.
3. On site config initialization, we register a new `SpanProcessor` onto
the global `TracerProvider`.
4. From this point on, site config updates:
a. Create a new `SpanProcessor` based on site config
b. Register the new `SpanProcessor`, which starts processing
c. Remove the previous `SpanProcessor`, which flushes it and shuts it
down as well
Previously, we mimicked the old opentracing setup where we swapped the
global tracer, using a similar approach in OpenTelemetry to set up the
global TracerProvider. In OpenTelemetry, registering and unregistering
`SpanProcessors` is the correct way to do this it seems - the swapping
approach causes configuration propagation issues such as
https://github.com/XSAM/otelsql/pull/115 (allowing us to revert back to
upstream: https://github.com/sourcegraph/sourcegraph/pull/47429) and
possibly https://github.com/sourcegraph/sourcegraph/issues/42515 . This
correct approach was pointed out in
https://github.com/XSAM/otelsql/pull/115#issuecomment-1416777390, and
makes the setup easier to reason with:
- the previously swappable tracers now no longer need to have a locked
underlying implementation, they just add logging now
- we don't need to update 2 implementations, we make the update at a
lower level so that both ends of the OT/OTel bridge are updated
- debug mode is managed through a mutable atomic bool, instead of
through recreating entire tracers
## Test plan
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
Unit tests assert that span processor updates are propagated. Exporter
setup is functionally unchanged
---------
Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
These log messages are not useful at the information level IMO, and they add a lot of log noise when running the single-Docker-container (or the future single-program) deployment methods.
* quiet some logs from info to debug
I think these are all debug-level logs and do not contain useful information outside of debugging a deep problem.
* remove mention of non-existent package
* remove erroneous fmt.Println
This appears to be a debugging aid that was accidentally left in.
Currently we rely on implementors to explicitly check ShouldTrace. This works implicitly through internal/trace/ot which guards against ShouldTrace and returns no-op OpenTracing tracers, which worked until we started moving code over to use OpenTelemetry libraries directly, where a similar guard was not implemented.
This wraps all tracers provided from the global switchable oteltrace.TracerProvider to only start spans when ShouldTrace evaluates to true without the caller having to do anything explicitly. It also updates ShouldTrace to respect global trace-all without requiring the context set.
Now that all our core deployment methods bundle OpenTelemetry Collector by default (#40456, #40457, #40455), we can now switch over to enabling OpenTelemetry by default. This change also:
- removes the default `traceURL` template, since the Jaeger instance it points to by default no longer exists on most deployments.
- fixes the behaviour of `observability.tracing` config, and adds testing
- refactors some things in `internal/trace` to be OpenTelemetry-first
Closes#39399
Since observability components are not run by default in sg start, we also amend the error logging of the exporter to suppress issues unless observability.tracing.debug is enabled.
Now that we export to Jaeger via OpenTelemetry instead of OpenTracing, we can remove a number of OpenTracing-specific things internally and instead operate purely on OpenTelemetry within internal/trace. OpenTracing calls elsewhere are all translated to OpenTelemetry via the OpenTracing bridge.
This allows us to start leveraging OpenTelemetry APIs more extensively - the codebase can now directly use go.opentelemetry.io/otel/trace, and internal/trace has been updated with the addition of:
- (*internal/trace.Trace).AddEvent - this is similar to LogFields, but now accepts a name in the OpenTelemetry spec
- (*internal/trace.Trace).SetAttributes - this is similar to TagFields, but uses the OpenTelemetry attributes API
And the deprecation of:
- (*internal/trace.Trace).LogFields (use AddEvent instead)
- (*internal/trace.Trace).TagFields (use SetAttributes instead)
As well as the removal of some old adapters that were specific to OpenTracing. Deprecated APIs will still work, passing through to the updated APIs.
Since we can provide a Jaeger exporter directly to the OpenTelemetry library, this allows us to provide capabilities similar to before while removing our dependency on OpenTracing/Jaeger, allowing us to streamline internal/trace considerably and allows the codebase to freely use OpenTracing or OpenTelemetry without worrying about compatibility (e.g. without this change, OpenTelemetry instrumentation like #39300 would not export via "observability.tracing": { "type": "jaeger" }")
Adds "observability.tracing": { "urlTemplate": "..." } for configuring where generated trace links point to. Preserves back-compat with existing behaviour if configured.
Adds an OpenTelemetry protocol (OTLP) adapter tunnel through the frontend (backend), /-/debug/otlp, that accepts http/json OTLP requests and sends them to either a grpc or http/json OpenTelemetry collector (depending on the configured protocol).
We build an adapter because the OpenTelemetry collector is gRPC by default in most of its guidance, while browser environments can only support the http/json protocol. This allows admins to use their preferred OpenTelemetry backend of choice without worrying about compatibility with the http/json protocol. This is done by (ab)using OpenTelemetry collector (not the Go API) packages to partially implement an OpenTelemetry receiver on /-/debug/otlp - see the otlpadapter package for more details.
Turns out there's an Jaeger propagator we're supposed to use for the opentelemetry library. After enabling it and using the composite propagator on both ends of the tracer, we now have linked traces to Zoekt's opentelemetry Jaeger implementation again in Jaeger with opentelemetry enabled.
This upgrades the opentelemetry-go dependency to the new 1.8.0 release,
which includes a patch to add support for TextMap baggage in the bridge tracer.
This means we no longer need our wrapper around the bridge, so we remove it.
We extend internal/tracer to support a new tracer type, "opentelemetry", that sends everything in OpenTelemetry format using the OpenTracing bridge. See https://github.com/sourcegraph/sourcegraph/pull/37984 for more details.
This change extracts generic "should we trace" logic into a new package, `internal/trace/policy`, to disentangle it from the OpenTracing-specific `internal/trace/ot` package.
Now that we require go1.18, we can use a builtin type alias for
"interface{}": "any". I've updated all code except lib, which can be
imported by external modules. That is currently pinned at go1.16.
Additionally I had to rerun generate since rewriting generated go code
will fail CI.
find -name '*.go' | xargs gofmt -s -r 'interface{} -> any' -w
git checkout lib
go generate ./...
Test Plan: CI will exercise that the code still compiles. Otherwise this
is a noop.