This PR aims to craft the /internal/tenant package for use by all Sourcegraph cluster-internal services to properly scope data visibility to the correct requesting tenant.
For now, we only expose methods we know we will DEFINITELY need.
This PR also adds the required middlewares so we can start to tinker with it in implementations.
## Test plan
CI passes. We don't enforce anything for now except not passing unparseable tenant IDs, which should be fine.
Recently, this was refactored to also allow using the redispool.Store.
However, that makes it very implicit to know where something is being
written, so instead we pass down the pool instance at instantiation.
This also gives a slightly better overview of where redispool is
actually required.
Test plan: CI passes.
Not used anymore outside of this package, and is in line with what we do for the external doer factory.
Test plan:
CI passes, go compiler doesn't complain.
There was a weird dependency between internal/conf and internal/httpcli because
of a cyclic import. This is resolved by using conftypes instead, so we can properly
split those concerns.
Test plan:
Existing tests don't break.
We had multiple CVEs reported for these dependencies. I don't think this
affected us in practice, but this is a step towards a clean scan from
trivy. I updated to the minimum version which supports the fix.
go get github.com/moby/buildkit@v0.12.5
go get github.com/docker/docker@v24.0.7
go get -u github.com/crewjam/saml
In the case of code.gitea.io/gitea@v1.18.0/modules/hostmatcher we
couldn't update it due to lots of issues popping up in random
transitive dependencies. However, we don't depend on the whole gitea
project, rather just a tiny self contained package in it. So we vendor
it in.
Test Plan: CI and "trivy fs go.mod" reporting no issues.
* add initial aspect workflow yaml
- try reading docker config env var
- bump both timers
- bump grpc test timout to long
- skip additional perforce test and run all tests
- bump timeouts
- more timeout bumps and skip p4 test
- bump doc:test timeout
- bump e2e_test timeout
- bump database/connections/live timeout
- tag integration tests as exclusive
* add recommended bazelrc in workflows to speed up cold builds
* disable experimental_fetch_all_coverage_outputs
* port changes from https://github.com/sourcegraph/sourcegraph/compare/aspect-trial/wb-add-initial-config...aspect-trial/wb-add-initial-config-greg
* bazel configure
* add //:postcss_config_js as data target to client/web
* remove postcss added in debug
* use node-fetch and only test codeintellify
* use testing fetch.js setup
* fix syntax in testSetup
* various fixes
revert timeout bump on repository test
re-enable git p4 test
add testing from shared deps
bazel configure
* update comments on skipped tests
* restore `is_percy_enabled` for mocha_test
* slightly increase repo cloning wait
* use process.cwd instead of __dirname
* set sizing to moderate as well for embeddings
* remove setting CI in workflows yaml
* fix sizing
* workflow yaml tweaks and bazelrc tweaks
* make bazelrc consistent with what was in workflow yaml
---------
Co-authored-by: Jean-Hadrien Chabran <jh@chabran.fr>
Co-authored-by: Greg Magolan <greg@aspect.dev>
This change allows clients to provide an interaction ID, essentially a trace but maybe simpler for some scenarios, and have that be propagated throughout Sourcegraph systems implicitly. We can automatically add this to events (telemetry V2 and Cody Gateway). The implementation closely follows the patterns we have for `internal/requestclient` and `internal/actor`
If we land this mechanism, I'll follow up with an change that automatically injects these IDs into telemetry events within a context as well. If clients send a bunch of completion requests to the backend with this header, the interaction ID will automatically be added to the completion events recorded by the backend, like [these ones](https://sourcegraph.com/search?q=context:global+repo:sourcegraph/sourcegraph+.Record(...,+%22cody.completions%22,+...)&patternType=structural&sm=1&groupBy=repo), and any future events. Similarly, if clients set this header on outbound requests when recording events in the new telemetry (go/telemetry-v2), they'll get added as well - we can also consider adding an interaction ID as an explicit argument in the `recordEvent` GraphQL mutation to take precedence over the context interaction ID.
* log: remove use of description paramter in Scoped
* temporarily point to sglog branch
* bazel configure + gazelle
* remove additional use of description param
* use latest versions of zoekt,log,mountinfo
* go.mod
This service is being replaced by a redsync.Mutex that lives directly in the GitHub client.
By this change we will:
- Simplify deployments by removing one service
- Centralize GitHub access control in the client instead of splitting it across services
- Remove the dependency on a non-HA service to talk to GitHub.com successfully
Other repos referencing this service will be updated once this has shipped to dotcom and proven to work over the course of a couple days.
This tries to mitigate a few issues we've been seeing in Cody Gateway,
namely that we have 2 layers of retries:
1. Cody Gateway retries upstream
2. Sourcegraph retries Cody Gateway
When we get a series of errors from upstream, e.g. if we get momentarily
rate-limited due to concurrent requests, this instantly causes request
count to explode as Cody Gateway retries, then Sourcegraph retries. Most
of these retries never go anywhere, we just end up blocking more and
more requests as they aggressively retry (our jitter algorithm goes from
0 and 200ms * 2^attempt capped at 3s by default for external services).
This PR implements three changes to try and mitigate this problem on the
Sourcegraph end:
1. `httpcli`: When deciding whether or not to retry, we now always
attempt to respect `retry-after` header on _all_ error responses, since
[503 `retry-after` is part of the standard it seems anyway](
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Retry-After),
when deciding whether a retry is likely to help the situation (i.e. it
must be resolvable within
`SRC_HTTP_CLI_EXTERNAL_RETRY_AFTER_MAX_DURATION=3s`). If no
`retry-after` is available, the old behaviour is preserved (always
retry).
2. `httpcli`: When deciding the _delay_ between retries, instead of
relying only on our jitter algorithm, we now also try and interrogate
`retry-after` to extract a duration to wait while respecting the
configured base (`SRC_HTTP_CLI_EXTERNAL_RETRY_DELAY_BASE`) and maximum
(`SRC_HTTP_CLI_EXTERNAL_RETRY_DELAY_MAX`). This allows us to tell
Sourcegraph clients to not even bother with shorter-delay attempts, and
to just wait for as long as we are allowed to. If the wait is too long,
the retry policy should indicate that. If `retry-after` is not
available, the old behaviour is preserved (exponential jitter delay)
3. Cody Gateway: we now always set `retry-after` from upstream (right
now neither Anthropic or OpenAI provide this, but we do this just in
case they do), otherwise set a default `retry-after`: for Anthropic this
is 2s, since they count concurrents that could be resolved by trying
again on a longer interval, and for OpenAI this is 30s for completions
and longer for embeddings since they use a minute-based rate-limiting
window (see docstrings), so retrying quickly seems unlikely to yield a
positive result
This should help lower the velocity at which we get retries without
changing the defaults or default behaviours in most cases. Additionally,
we're lowering the number of times Cody Gateway attempts to reach its
upstream clients:
https://github.com/sourcegraph/infrastructure/pull/5293
## Test plan
Unit tests on httpcli helpers
This will be my last PR related to the backend tracing work I've been
doing. This is a set of small cleanups to the `trace` package that I've
collecting as I have worked on tracing and used tracing. Following this
PR, the `trace` package is just a very lightweight wrapper around the
standard OpenTelemetry APIs. I think it's best to keep the package
around rather than using opentelemetry directly because it easy to add
convenience methods (which I would be sad to lose).
Each commit is self-contained and has a descriptive message.
If anyone wants to pick up where I'm leaving off, here are a few things
left undone:
- Convert Zoekt to use OpenTelemetry rather than OpenTracing
- Add OpenTelemetry support to other services like syntect-server
- Merge the `internal/trace` and `internal/tracer` packages
- Consider adding a type that conforms to the OpenTelemetry `Span`
interface but also writes to `x/net/trace` that can be enabled when
tracing is not available.
- Remove unrelated code from the `trace` and `tracer` package (see
[here](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@a6759b95dbd8e5e3a604f7fd452b0b85f37091d9/-/blob/internal/tracer/tracer.go?L75-83)
and
[here](https://sourcegraph.com/github.com/sourcegraph/sourcegraph@769fbbf5008e8decc63967dbff53f26333620265/-/blob/internal/trace/buckets.go?L3-7))
- Noodle on a `Traceable` interface (one that impls `Attr()` or
`Attrs()`) so types can be easily added with `SetAttributes()`
- Experiment with sampling
- Experiment with replacing `policy.ShouldTrace` with native
opentelemetry tools
## Test plan
Tested manually that tracing still looks good locally. Will test on
other instances when it rolls out.
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
This removes the "title" argument to `trace.New()`. This was previously
only displayed as an attribute on the span, which is confusing since
people would often use a span name for this argument since it is called
"title." OpenTelemetry does not have the concept of a family and title
(these are from `x/net/trace`, which is no longer used), so this is
another step towards reconciling our tracing package with OpenTelemetry.
Basically, I went through all uses of `trace.New()` and either named the
span `family.title`, removed the title in the case it was an empty
string, or converted the title to a set of more structured attributes.
Now, `trace.New()` only takes a context, a span name, and an optional
set of attributes.
This removes the remainder of opentracing-related code from
sourcegraph/sourcegraph. We still use an opentracing propagator so that
downstream services that use opentracing (cough cough Zoekt) will still
work (but Zoekt is weird in a lot of ways and is kind of broken already
in some cases).
The previous approach to enable race detection was too radical and
accidently led to build our binaries with the race flage enabled, which
caused issues when building images down the line.
This happened because putting a `test --something` in bazelrc also sets
it on `build` which is absolutely not what we wanted. Usually folks get
this one working by having a `--stamp` config setting that fixes this
when releasing binaries, which we don't at this stage, as we're still
learning Bazel.
Luckily, this was caught swiftly. The current approach insteads takes a
more granular approach, which makes the `go_test` rule uses our own
variant, which injects the `race = "on"` attribute, but only on
`go_test`.
## Test plan
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
CI, being a main-dry-run, this will cover the container building jobs,
which were the ones failing.
---------
Co-authored-by: Alex Ostrikov <alex.ostrikov@sourcegraph.com>
Closes https://github.com/sourcegraph/sourcegraph/issues/50866
Gerrit is a bit weird so the creation and updating portions of Changes
are done through commits/pushes and therefore don't need API methods.
This probably isn't a comprehensive list for everything needed but I
will just add whatever methods are missing to followup tickets,
Don't be scared, this is a lot of generated code.
## Test plan
Added unit tests to cover the new methods.
Policy transport shouldn't have tracing tucked away in it - moves the
old OpenTracing thing here up into the transport option.
I originally thought we should remove it since otelhttp (added in
https://github.com/sourcegraph/sourcegraph/pull/51847) should propagate
the same thing via the propagators we initialize, but we might want to
keep it in case anyone relies on context things it sets.
## Test plan
n/a - functionally the same
This removes nearly all the dependencies on opentracing from the search
codebase.
The search codebase used the opentracing log types heavily for logging,
tracing, and pretty printing our search jobs. And because our search
jobs all conform to a common interface, it would be difficult to
approach this incrementally. So I just ripped the band-aid off and
converted everything to use the opentelemetry types.
Now the only dependent of the opentracing log types is the observation
package.
Previously, we were not propagating the parent trace ID across HTTP
requests. This didn't used to be the case -- I'm honestly not sure
when/how this changed, but it's been broken for at least a couple of
months.
This wraps internal requests with the Opentelemetry transport, which
will pass along the tracing information so that we can effectively
connect traces across multiple services together.
This PR adds some basic handling of the Retry-After for 429 errors.
Right now, we retry every 429 error, regardless of if the Retry-After
header tells us to wait another hour, where 20 retry attempts would not
help at all. We do that by parsing the header and checking if within the
next 3s we would regain access. If so, we keep trying, otherwise, we
stop retrying.
## Test plan
Verified that with a valid retry-after header this correctly stops
retrying. Also added a test suite for all the possible cases around 429
errors.
For some reason, we had panics in production because caching transport
wasnt unwrappable. Being unwrappable seems benign, so we'll implement it
for that
```
panic: httpcli.ExternalTransportOpt: http.Client.Transport cannot be cast as a *http.Transport: *httpcache.Transport
goroutine 7639 [running]:
github.com/sourcegraph/sourcegraph/internal/extsvc/crates.NewClient({0xc0024c4a68, 0x15}, 0xc0002120c0)
/home/noah/Sourcegraph/sourcegraph/internal/extsvc/crates/client.go:23 +0x125
github.com/sourcegraph/sourcegraph/cmd/gitserver/shared.getVCSSyncer({0x2556538, 0xc0029bc690}, {0x256ba50, 0xc0024c1ad0}, {0x25698a0, 0xc002402438}, 0x11?, {0xc002ccc378, 0x11}, {0xc00006045e, ...}, ...)
/home/noah/Sourcegraph/sourcegraph/cmd/gitserver/shared/shared.go:518 +0x830
github.com/sourcegraph/sourcegraph/cmd/gitserver/shared.Main.func2({0x2556538?, 0xc0029bc690?}, {0xc002ccc378?, 0xc002ccc378?})
/home/noah/Sourcegraph/sourcegraph/cmd/gitserver/shared/shared.go:147 +0x89
github.com/sourcegraph/sourcegraph/cmd/gitserver/server.(*Server).doBackgroundRepoUpdate(0xc00244db00, {0xc002ccc348, 0x11}, {0x0, 0x0})
/home/noah/Sourcegraph/sourcegraph/cmd/gitserver/server/server.go:2703 +0x3e6
github.com/sourcegraph/sourcegraph/cmd/gitserver/server.(*Server).doRepoUpdate.func1.1()
/home/noah/Sourcegraph/sourcegraph/cmd/gitserver/server/server.go:2639 +0x1d2
sync.(*Once).doSlow(0x0?, 0x4a34ea?)
/nix/store/dnvng8cdwc2i6g48qxijqmch3w8l3s85-go-1.20/share/go/src/sync/once.go:74 +0xc2
sync.(*Once).Do(...)
/nix/store/dnvng8cdwc2i6g48qxijqmch3w8l3s85-go-1.20/share/go/src/sync/once.go:65
github.com/sourcegraph/sourcegraph/cmd/gitserver/server.(*Server).doRepoUpdate.func1()
/home/noah/Sourcegraph/sourcegraph/cmd/gitserver/server/server.go:2631 +0x125
created by github.com/sourcegraph/sourcegraph/cmd/gitserver/server.(*Server).doRepoUpdate
/home/noah/Sourcegraph/sourcegraph/cmd/gitserver/server/server.go:2629 +0x4b8
```
## Test plan
Ran locally, no more panic
We've encountered that all our external requests are cached in redis,
which means (1) that we need to read the entire body into memory so we
can serialize it for redis, and (2) that we store potentially gigantic
binary files in redis, which we never need to fetch again. This
introduces an uncached version of the external doer, adds comments on
when to use which, and modifies each of the packages clients to use that
doer for the Download operation.
## Test plan
The unit tests
---------
Co-authored-by: Noah Santschi-Cooney <noah@santschi-cooney.ch>
~Quick~ update to catchup with latest changes and fix edge cases.
## Test plan
<!-- All pull requests REQUIRE a test plan:
https://docs.sourcegraph.com/dev/background-information/testing_principles
-->
Bazel jobs should be green.
---------
Co-authored-by: Jason Bedard <jason@aspect.dev>
Co-authored-by: Valery Bugakov <skymk1@gmail.com>
On sourcegraph.com the label with the highest memory usage is category
(562346 bytes vs the 2nd which is 179764). We tracked it down to the
metric src_internal_request_duration_seconds_bucket. For example we
register 3 metrics for every git repo we clone internally thanks to
using the path.
There is no reasonable alternative for us to use, nor is anyone relying
on this metric and category. So we are removing it for now by just
recording unknown.
Test Plan: go test
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
This is something that I thought could be improved. In everycase we
setup a watch for a FIFOList all we are doing it fetching a value.
Instead of updating the FIFOList, we can pass in a function which
FIFOList calls when it wants to know what size to trim to. This
simplifies nearly all use cases and removes the need to have a
SetMaxSize API.
The biggest change here was around syncjobs which had a few mocks setup
which meant the change wasn't as clean. But all other call sites should
look nicer.
Test Plan: go test
The use of AllKeys had a hidden cost of scanning _every_ key in redis,
not just those associated with outbound requests. So this commit ports
the outbound-requests logger to use a FIFOList, the same datastructure
used by our slow request logger.
This has the additional benefit of reducing our API surface in rcache,
since this was the only user of AllKeys and the Multi* functions.
We could likely do this better since we do things like lookup all logged
requests each time we want to read. But given the value is small (500
items) and it is only done on read (not log time), this seems like a
fine tradeoff versus doing a bigger change.
Test Plan: go test