The symbol service already stores each file's detected language. Now, when the
'search-content-based-lang-detection' feature is enabled, the symbols service
filters directly on the language column.
This PR doesn't update Rockskip, since it doesn't store the detected language.
In #59331, we refactored the searcher logic to support AND/ OR patterns in
searcher. While doing so, we accidentally regressed on an optimization that
avoids loading file content when the pattern is empty.
We now have a benchmark for empty patterns, but this benchmark was added
recently and wasn't run during that refactor.
Before this commit we only logged it searching actual instances.
However, we only send to honeycomb the aggregator. In practice this will
lead us to logging the time for the very first zoekt to respond.
Test Plan: go test
If using nix develop we can defer to our nix environment correctly
installing bazel. In particular on linux nix directly installs the
correct version of nix and we do not use bazelisk.
Test Plan: "sg start enterprise-bazel" on nixos
For tests we do not want to use the detection on os.Stdout to change
what output we capture and assert against. In particular when running
the sg tests for me under emacs's compilation-mode I get failures due to
extra escape codes.
This is a different take on ensuring tests do not have Isatty enabled.
The previous attempt used the passed in writer to determine this.
However, on further inspection of call sites we somewhat regularly
captures the output and then only show it if something went wrong. In
those cases we should preserve Isatty of stdout so the errors look
prettier.
Test Plan: go test -short ./...
For searches that hit Zoekt or the symbols backend, we return the file's
detected language through `result.File.PreciseLanguage`. Now, searcher will
also return the detected language whenever it's available. This value is used
in the search results filters panel, so it's important to be as accurate as
possible.
The precise language is only available when these criteria are met:
* The `search-content-based-lang-detection` feature is enabled -- we hope to
enable this by default soon
* The search contains a language filter, so we've already run language
detection
The PR also adds an optimization to run the expensive language detection
only after checking the search pattern.
* migrate just the upgrade test tooling from the original branch
* add backfiller override to store init
* address reviewer feedback
* remove testing from PR
* fix backfill unitt ttest to account for overrides
These query paths weren't tested so didn't find them initially, although I tried to go over all search results for external_services and external_service_repos.
## Test plan
Went over the results again, these were the only two remnants I found.
These fields were used in the cloud v1 implementation, but are no longer used. All the records have been removed from Sourcegraph.com. As a precaution, I've added ANOTHER round of deletes here.
Closes#43115
As I dug into why we're allocating a ton of memory in io.Copy when forwarding the stdout from exec calls to the writer, I found that we allocate 128kb every time, while many calls actually only write 40 bytes (a commit SHA).
I then looked at gitaly and noticed that we use a very old version of their implementation.
So here we go, updating it to the latest.
Note that we currently don't need the NewSyncWriter, but I kept it anyways, maybe it reminds us of the importance if we ever need it.
List of commits from gitaly: https://gitlab.com/gitlab-org/gitaly/-/commits/master/streamio/stream.go?ref_type=heads
---
streamio: Implement synchronized writer
Writing to gRPC streams concurrently isn't allowed, which is why we need
to synchronize those writes. Doing so with the `NewWriter()` is
repetitive, so this commit implements a new function `NewSyncWrite()`
which receives a mutex in addition to the callback function. The mutex
is then locked and unlocked previous to invoking the callback.
---
streamio: Simplify implementation of ReadFrom
The ReadFrom implementation is quite hard to read as the loop depends on
state from the previous iteration. This commit refactors the code to not
do so anymore by always sending data if we've read something, even if
there was a read error.
---
streamio: Clarify why read errors are deferred
The implementation of `Read()` only returns read errors in case no data
is left in the buffer. This is required such that we correctly drain any
buffered data and only return the error when no data is left. This may
not be immediately clear, so this commit puts a comment on this and adds
a testcase which exercises this code path.
---
streamio: remove custom ReadFrom and WriteTo
This replaces sendWriter.ReadFrom and receiveReader.WriteTo with
trivial implementations based on io.Copy, to be removed in 14.0.
The purpose of the io.ReaderFrom and io.WriterTo interfaces is to
optimize away a buffer allocation in io.Copy, and/or being able to do
something smart like calling sendfile(2). Without them, io.Copy
allocates a 32KB buffer and reads into / writes from that buffer in a
loop.
The thing is, sendWriter.ReadFrom and receiveReader.WriteTo neither
make a difference for allocations, nor do they do something "smart".
First of all, sendWriter.ReadFrom was not helping the buffer
allocation efficiency because it allocated its own 128KB buffer. In
fact, sendWriter.ReadFrom is equivalent to what io.Copy does on the
inside, so we might as well let io.Copy do that.
That leaves receiveReader.WriteTo as a potentially useful
optimization. It does prevent io.Copy from allocating that 32KB
buffer. However, everytime it calls receiver(), gRPC allocat...
---
Remove streamio ReaderFrom and WriterTo
These were set to be removed in v14.
Changelog: removed
---
streamio.Reader: remember errors
If there has been an error, subsequent calls to Read should keep
returning that error.
Changelog: fixed
## Test plan
Verified locally that sourcegraph still works for basic clickops. CI will verify that archive fetching etc still behaves as previously.
I will take another look at the cloud profiler once this has been rolled out.
This seems to have been accidentally misnamed in the recent refactor to the p4 command pattern.
## Test plan
Code review, we currently cannot write perforce "unit" tests.
This PR adds support for the `search-content-based-lang-detection` feature flag
to unindexed search. It introduces a new type of matcher called `langMatcher`
that uses go-enry to determine a file's language.
The PR is big, but it's broken into commits. Main pieces:
* Refactor `search_regex` to introduce a `fileLoader` object
* Add language matching using go-enry
* Introduce new `IncludeLangs` and `ExcludeLangs` fields so we can pipe
language filters directly to searcher
Most file changes are just due renaming `IncludePatterns` -> `IncludePaths`.
This adds a new flag that allows us to run `sg ops update-images --only='gitserver'` to update only the image of a specific service.
## Test plan
Code review.
* Revert "Fix redirect (#60614)"
This reverts commit 44c4f2ccf4.
* Revert "Set returnTo when redirecting to a third-party auth provider on dotcom (#60508)"
This reverts commit ab9050fae2.
- App is no longer, and janitor really has to run, we shouldn't comment it out
- Tree test double-initialized the gitserver
- Removed duplicate reponame variable
## Test plan
Small code style changes, CI will find issues.
We no longer need this endpoint. Also found a few other things related to internalapi on the way that could use a cleanup.
## Test plan
Integration tests should verify that conf works just as before.
All our containers should be using tini as PID 1 to cleanup zombies. In
the case of syntax-highlighter we did not. Additionally we did some
janky inline shell script so we could inject the $WORKERS envvar as an
argument.
This commits moves the janky script into an entrypoint script and calls
it via tini. Additionally it uses exec so we don't have a parent shell
messing with signals.
I picked the name syntect_server_entrypoint since I had some concerns
this file would end up in the server docker image, so wanted to give it
a unique name. This may make it possible for us to use this script as
well in server if we choose to.
Test Plan: built the image and ran it. Then checked we had the process
running with child workers correctly.
bazel build //docker-images/syntax-highlighter:image_tarball
docker load --input $(bazel cquery //docker-images/syntax-highlighter:image_tarball --output=files)
docker run --rm=true syntect-server:candidate
Then on the running container:
$ docker exec trusting_leavitt ps aux
PID USER TIME COMMAND
1 root 0:00 /sbin/tini -- /syntect_server_entrypoint.sh
7 root 0:00 /usr/local/bin/http-server-stabilizer -listen=:9238 -prometheus-app-name=syntax_highlighter -workers=4 -- env ROCKET_PORT={{.Port}} /syntect_server
14 root 0:02 /syntect_server
15 root 0:02 /syntect_server
17 root 0:02 /syntect_server
18 root 0:02 /syntect_server
157 root 0:00 ps aux
Adds a new:
- gazelle generator
- rule + rule targets + catchall target
for generating go-mockgen mocks & testing for their being up-to-date.
Each go_mockgen macro invocation adds targets for generating mocks, copying to the source tree, as well as testing whether the current source tree mocks are up-to-date.
How to use this: `bazel run //dev:go_mockgen` for the catch-all, or `bazel run //some/target:generate_mocks` for an individual package, and `bazel test //some/target:generate_mocks_tests` to test for up-to-date-ness. There is no catch-all for testing
This currently uses a fork of go-mockgen, with an open PR for upstream here: https://github.com/derision-test/go-mockgen/pull/50.
Closes https://github.com/sourcegraph/sourcegraph/issues/60099
## Test plan
Extensive testing during development, including the following cases:
- Deleting a generated file and its entry in a go_library/go_test `srcs` attribute list and then re-running `sg bazel configure`
- Adding a non-existent output directory to mockgen.test.yaml and running the bash one-liner emitted to prepare the workspace for rerunning `sg bazel configure`
The existing config tests a lot of existing paths anyway (creating mocks for a 3rd party library's interface, entries for a given output file in >1 config file etc)