This PR aims to craft the /internal/tenant package for use by all Sourcegraph cluster-internal services to properly scope data visibility to the correct requesting tenant.
For now, we only expose methods we know we will DEFINITELY need.
This PR also adds the required middlewares so we can start to tinker with it in implementations.
## Test plan
CI passes. We don't enforce anything for now except not passing unparseable tenant IDs, which should be fine.
This PR adds an initializer function that will be called from the svcmain package through the call to tracer.Init.
The reasoning behind this is that it's very easy to accidentally use a package that uses conf internally from a service like Cody Gateway, Appliance, Migrator, or other MSP services and just because we don't have any config source for the trace ID should not let the process stall entirely.
To make it absolutely clear that a dependency is safe to use from a conf perspective, this change indicates that well by dropping the dependency on it entirely and making it another packages concern to pass this type down.
Test plan:
All tests for Sourcegraph are still passing.
---------
Co-authored-by: Robert Lin <robert@bobheadxi.dev>
This commit has a few changes to how we do telemetry for zoekt.
- SearchOption attributes are part of span creation.
- SearchOption and Stats attributes are filtered to remove default
values. This mirrors how we log in net/trace which I find useful when
reading options.
- Add missing fields from SearchOption
- EndWithErrIfNotContext helper introduced and used for Search and List.
- List uses same pattern for telemetry as Search
- List logs actor
- List correctly logs repos count and always logs stats.crashes.
Test Plan: go test and "HONEYCOMB_LOCAL=true sg start" followed by a
search. I then visually eyeballed the log output for events.
This showed up in some profiles as being an area of ~high contention, 99% of which is coming from parsing the template string every time. This PR adds some caching to it.
## Test plan
n/a - code review
This is prep for an upgrade of our otel-collector itself - we are adding a dependency on a beta receiver in https://github.com/sourcegraph/sourcegraph/pull/59561 , and want to make sure we are up to date.
This change also gets us to the stable metrics API, which is nice.
This mirrors the change from https://github.com/sourcegraph/sourcegraph/pull/57774 but applies the addition of trace ID to:
1. translation of v2 events into `event_logs`
2. bulk-inserting events into `event_logs` via the legacy mutations and direct usages
While we will be deprecating widespread usage of the legacy events, we still need to support it somewhat for now, and going forward we will still retain the `event_logs` table as a semistructured reference for events that get recorded.
This change updates our internal tracer to enforce policy only via a `Sampler` implementation. This has the following benefits:
1. Even when a trace should not be sampled, contexts are still populated with valid spans, rather than no-op ones. This is important to make use of trace IDs for non-tracing purposes, e.g. https://github.com/sourcegraph/sourcegraph/pull/57774 and https://github.com/sourcegraph/sourcegraph/pull/58060
2. We enforce trace policies the way they were meant to be enforced in OpenTelemetry: by simply indicating that the span should not be exported.
This was not possible before because OpenTracing did not use context propagation, hence we did not have a way to use trace policy flags set in context - but thanks to @camdencheek's work removing OpenTracing entirely, we can now do this in a more idiomatic fashion.
Thanks to this, I've removed a few places that prevented trace context from being populated based on trace policy (HTTP and GraphQL middleware, and `internal/trace`). This delegates sampling decisions to the sampler, and ensures we accept valid trace context everywhere.
## Test plan
Unit tests on a TracerProvider configured with the new sampler.
Manual testing:
```
sg run jaeger otel-collector
sg start
```
Setting `observability.tracing.debug` to `true` we can see logs indicating the desired traits for non-`ShouldTrace` traces:
```
[ worker] INFO tracer tracer/logged_otel.go:63 Start {"spanName": "workerutil.dbworker.store.insights_query_runner_jobs_store.dequeue", "isRecording": false, "isSampled": false, "isValid": true}
```
With `observability.tracing.sampling` set to `none`, running a search with `&trace=1` only gives us spans from zoekt, which seems to have always been outside our rules here.
With `observability.tracing.sampling` set to `selective`, running a search with `&trace=1` gives us a full trace.
With `observability.tracing.sampling` set to `all`, Jaeger instantly gets loads of traces, and in logs we see:
```
[ worker] INFO tracer tracer/logged_otel.go:63 Start {"spanName": "workerutil.dbworker.store.exhaustive_search_worker_store.dequeue", "isRecording": true, "isSampled": true, "isValid": true}
```
---------
Co-authored-by: William Bezuidenhout <william.bezuidenhout@sourcegraph.com>
* log: remove use of description paramter in Scoped
* temporarily point to sglog branch
* bazel configure + gazelle
* remove additional use of description param
* use latest versions of zoekt,log,mountinfo
* go.mod
We're getting a huge number of mysterious JSON issues in our GraphQL routes: https://sourcegraph.slack.com/archives/C05EMJM2SLR/p1694018006616969
All we get is an `encoding/json/*json.SyntaxError (*::)` right now which is not helpful. This adds a bit of error wrapping so we can at least tell if the error came from serving the GraphQL request.
This change also:
- removes duplicate error logging in `ErrorHandler`, we have much more robust logging in a separate layer
- removes a very old `reportError` method that directly reports to Sentry or something. This is almost certainly unused now that we use `sourcegraph/log` to report errors.
## Test plan
CI
This will be my last PR related to the backend tracing work I've been
doing. This is a set of small cleanups to the `trace` package that I've
collecting as I have worked on tracing and used tracing. Following this
PR, the `trace` package is just a very lightweight wrapper around the
standard OpenTelemetry APIs. I think it's best to keep the package
around rather than using opentelemetry directly because it easy to add
convenience methods (which I would be sad to lose).
Each commit is self-contained and has a descriptive message.
If anyone wants to pick up where I'm leaving off, here are a few things
left undone:
- Convert Zoekt to use OpenTelemetry rather than OpenTracing
- Add OpenTelemetry support to other services like syntect-server
- Merge the `internal/trace` and `internal/tracer` packages
- Consider adding a type that conforms to the OpenTelemetry `Span`
interface but also writes to `x/net/trace` that can be enabled when
tracing is not available.
- Remove unrelated code from the `trace` and `tracer` package (see
[here](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@a6759b95dbd8e5e3a604f7fd452b0b85f37091d9/-/blob/internal/tracer/tracer.go?L75-83)
and
[here](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@769fbbf5008e8decc63967dbff53f26333620265/-/blob/internal/trace/buckets.go?L3-7))
- Noodle on a `Traceable` interface (one that impls `Attr()` or
`Attrs()`) so types can be easily added with `SetAttributes()`
- Experiment with sampling
- Experiment with replacing `policy.ShouldTrace` with native
opentelemetry tools
## Test plan
Tested manually that tracing still looks good locally. Will test on
other instances when it rolls out.
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
This removes the "title" argument to `trace.New()`. This was previously
only displayed as an attribute on the span, which is confusing since
people would often use a span name for this argument since it is called
"title." OpenTelemetry does not have the concept of a family and title
(these are from `x/net/trace`, which is no longer used), so this is
another step towards reconciling our tracing package with OpenTelemetry.
Basically, I went through all uses of `trace.New()` and either named the
span `family.title`, removed the title in the case it was an empty
string, or converted the title to a set of more structured attributes.
Now, `trace.New()` only takes a context, a span name, and an optional
set of attributes.
This removes the `LazyPrintf` method on `trace.Trace`. This method
existed because of `x/net/trace`, but now we can just use `AddEvent`
instead. It's not quite as ergonomic, but it provides more readable span
events (they're not named `LazyPrintf`) and helps discourage adding
massive payloads to a span.
This is another step towards standardization on OpenTelemetry.
Currently, if we have a raw OTel trace in the context but no active
`trace.Trace`, we `TraceFromContext` will not use that, so we can lose
some tracing information. This provides a temporary fix to make
`TraceFromContext` create a new, ephemeral `*Trace` that contains the
OTel span if it exists.
I think the real solution here is to just go all in on the OTel tooling
and make our custom trace stuff conform to the OTel interfaces. I'll
work on that in the background, but I want to get this out before 5.1.
This allows us to return `X-Trace` and `X-Trace-Span` from Cody Gateway
with the shared HTTP instrumentation helper, which will help linking up
traces in case GCP tracing doesn't work out, and also help with
debugging customer instances that talk to Cody Gateway. Outside of
Sourcegraph core we don't have access to conf so setting this to a URL
doesn't make as much sense, and there seems to be a somewhat
pre-existing standard in Jaeger to set `x-trace` to a trace ID (?), and
`X-Trace-Span` just tacks onto that with a bit more detail (for Cody
Gateway locally, `x-trace` will point to the entire trace, would be nice
to have the span ID as well).
In Sourcegraph core, this means that we now can't set the URL to
`X-Trace` as we've done before. Instead, we now set it to `X-Trace-URL`,
and send both in responses. There appears to be no references to
`X-Trace` that expects an URL, so I think we should be good here -
streaming search uses a field on its responses.
Related: #53025
## Test plan
With `?trace=1` in both Cody chat and search, I now get something like
the following:
<img width="708" alt="image"
src="https://github.com/sourcegraph/sourcegraph/assets/23356519/b56f63e5-ad3e-4aad-8108-3aaa6af6f67e">
The "View trace" still works as expected, and the link and the headers
point to the same trace.
This removes the remainder of opentracing-related code from
sourcegraph/sourcegraph. We still use an opentracing propagator so that
downstream services that use opentracing (cough cough Zoekt) will still
work (but Zoekt is weird in a lot of ways and is kind of broken already
in some cases).
I suspect that one of the reasons we're failing to get a lot of our
traces on sourcegraph.com is that we're generating very large error
strings because we're joining tons of errors together into massive
multierrors.
This updates our tracing code to truncate the error string if it's
greater than 512 runes.
The previous approach to enable race detection was too radical and
accidently led to build our binaries with the race flage enabled, which
caused issues when building images down the line.
This happened because putting a `test --something` in bazelrc also sets
it on `build` which is absolutely not what we wanted. Usually folks get
this one working by having a `--stamp` config setting that fixes this
when releasing binaries, which we don't at this stage, as we're still
learning Bazel.
Luckily, this was caught swiftly. The current approach insteads takes a
more granular approach, which makes the `go_test` rule uses our own
variant, which injects the `race = "on"` attribute, but only on
`go_test`.
## Test plan
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
CI, being a main-dry-run, this will cover the container building jobs,
which were the ones failing.
---------
Co-authored-by: Alex Ostrikov <alex.ostrikov@sourcegraph.com>
This just adds a little helper method for a very common operation when
using tracing: finish the span and set its error.
Basically, this turns a 4-liner into a 1-liner. This is nice because
tracing can add quite a bit of noise to a function. Also, it's directly
deferrable so it's less error-prone.
`trace.Tag` is only used in the `trace.New()` function, after which it
is immediately translated to opentelemetry attributes. Let's reduce the
number of types we have to deal with and just use attributes directly.
This does the minimum amount of work needed to remove the deprecated
methods on `trace.Trace`. We still rely on the `otlog.Field` type
heavily in the observation package, so to minimize the surface area of
the change, I exported the `OTLogFieldsToOTelAttrs` field so we can
start to push the opentelemetry attribute types upwards in the stack
more incrementally.
## Test plan
It compiles + existing tests.
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
Policy transport shouldn't have tracing tucked away in it - moves the
old OpenTracing thing here up into the transport option.
I originally thought we should remove it since otelhttp (added in
https://github.com/sourcegraph/sourcegraph/pull/51847) should propagate
the same thing via the propagators we initialize, but we might want to
keep it in case anyone relies on context things it sets.
## Test plan
n/a - functionally the same
This removes nearly all the dependencies on opentracing from the search
codebase.
The search codebase used the opentracing log types heavily for logging,
tracing, and pretty printing our search jobs. And because our search
jobs all conform to a common interface, it would be difficult to
approach this incrementally. So I just ripped the band-aid off and
converted everything to use the opentelemetry types.
Now the only dependent of the opentracing log types is the observation
package.
Previously, every slow request would be logged with the first trace ID
that the middleware encounters. This change ensures we make a copy of
the base logger first before creating new loggers with additional
fields.
Detected by @michaellzc who noticed all slow requests were being logged
with the same executor request, even when they weren't :)
## Test plan
n/a
Previously we would warn after 5s. Personally on my dev machine (in
South Africa) this would fire very often since GitHub isn't as fast for
me. Additionally for customers with large monorepos I expect this log is
constantly spamming. I think a better approach is to instead put an
extremely high value which if exceeded truly indicates a warning.
Test Plan: go test
A repo label is set on httptrace metrics, but it only tracks a small set of hardcoded repositories that have not been changed for at least 4 years. There do not appear to be any references to this label in any dashboard either.
origin is another label untouched for 4+ years, and is always set to 'unknown' in s2. Tracking references also reveals two metrics that have no references in dashboards.
policy.ShouldTrace does not guarantee that span has been started within a context, just that one should be created if anyone asks. opentracing.SpanFromContext documents that it can return nil if no span is started, which may cause a panic when used. To mitigate this, this we update the docstring of policy.ShouldTrace and make sure usages of opentracing.SpanFromContext always check for the nil-ness of the span. We do this even if a trace is started right before, just in case.
The OpenTelemetry equivalent, trace.SpanFromContext, returns a zero-value no-op span instead (though there are no usages of this function yet).