Closes SRCH-802
After a long and unsuccessful hunt of eventual consistency errors, I
noticed that we don't need the GraphQL fragments that are causing
problems.
I introduced a simplified request, which succeeds reliably. What
customers should see now is one error (see second picture below), which
is followed by a retry, and then success.
Previously we've seen this error, sometimes less often, sometimes more,
but customers saw it so often that it caused their batch changes to
fail.
<img width="1042" alt="Screenshot 2024-08-06 at 14 02 00"
src="https://github.com/user-attachments/assets/c5a099da-d474-43db-ac12-ac7c4f22d4d3">
Customers may still see an initial error, which I've seen to reliably
disappear on the first retry. It should not be a problem
<img width="1039" alt="Screenshot 2024-08-06 at 14 02 48"
src="https://github.com/user-attachments/assets/99f278d4-4020-48ea-b4ac-32cbdfce455d">
## Test plan
Manual testing
## Changelog
- fix(batches): improve GitHub Apps integration reliability by
simplifying the data requested from GitHub
This is a register call that is easy to forget. When forgotten, all queries against the repo store will block forever.
In addition, this adds a hard-dependency on conf to every services startup, plus a busy loop. With multi-tenant, this will not work great because authz providers would be a global, and we instead want most things to be ephemeral so they're per-provider. This is a step toward that, but doesn't yet remove the providers global variable.
Good news, it turns out that we don't actually need to register the providers in every service! The reason they were required was to check if zero providers are configured, or if authzbypass mode is enabled.
Authz bypass mode is usually ON, except when there are problems with the authz providers, meaning some authz providers might not be able to sync permissions. Bypassing of permissions is only ever happening if there are ALSO zero providers configured.
So this is basically an optimization for the case where an instance has zero authz configured so that the SQL queries are a bit simpler. This also helps in tests because with bypass mode on and no providers configured, authz enforcement is effectively off in the repo store.
This makes it so that in tests we need to do slightly more work, but also makes for a more realistic test vs at runtime setup. Also, it's highly recommended to use mocks for DB wherever possible in more high-level components to keep tests fast.
To never have a scenario where we accidentally mess up here and enable bypass mode erroneously, this PR drops that entirely. Authz is always enforced, but when a code host connection is unrestricted (i.e., will not spawn a provider) the repos are still visible, so this should be no change over before.
## Test plan
The stack starts and works, and all CI tests are still passing. Code review should help as well.
Closes SRCH-741
Closes SRCH-716
This PR removes the GitHub installation code from the redirect flow to a
webhook-based appraoch. We expect that the GitHub server calls the
webhook when the installation is ready, and therefore shouldn't see the
errors explained in the issues above.
To handle the potential delay until the webhook is called and the
credential is set up, I added a scrappy info notice that the user should
refresh their page:
<img width="928" alt="Screenshot 2024-07-24 at 13 48 24"
src="https://github.com/user-attachments/assets/4d298f2a-d7b8-423b-9e2f-2ae53fbce1ac">
Below is what you see after you refreshed (or if the webhook was called
faster than the user being redirected back to the settings):
<img width="929" alt="Screenshot 2024-07-24 at 13 50 14"
src="https://github.com/user-attachments/assets/b6826158-8561-476d-b20e-e36f8cfb86fd">
I'm able to create PRs for sourcegraph-testing with an app that was
created this way.
<img width="1171" alt="Screenshot 2024-07-24 at 16 16 06"
src="https://github.com/user-attachments/assets/86e20acb-136f-4a46-a33b-bdfdd0d51d71">
I'm seeing an error when getting an access token with a personal github
app to run a batch change, but that will be handled with another PR.
<img width="1053" alt="Screenshot 2024-07-24 at 16 38 38"
src="https://github.com/user-attachments/assets/5655ba91-1ae4-453a-8d5c-1bcdbe34bc17">
## Test plan
Manual testing locally, and more testing to be done on S2 where we have
a more production like environment
## Changelog
- When installing a GitHub app for batch changes, the instance now waits
for a callback from GitHub to complete the installation to avoid issues
from eventual consistency.
---------
Co-authored-by: Peter Guy <peter.guy@sourcegraph.com>
This method can be used to find a common ancestor for many commit SHAs,
to be used by code intel for finding visible uploads.
Closes SRC-485
Test plan: Added unit tests for the several layers (client, grpc,
gitcli).
This PR adds better gating for disabling the code monitors, notebooks,
search jobs and own features.
## Test plan
Ran locally and verified the features are gone when the env var is set.
Ran again without those and they worked.
---------
Co-authored-by: Stefan Hengl <stefan@sourcegraph.com>
Closes SRCH-685
When a GitHub app credential is used to push a commit, the commits
aren't signed.
With this PR, we re-use the `DuplicateCommit` method to create a signed
commit so users who don't have a standalone Commit Signing app installed
still get a signed commit when the commit is created with a GitHub app.

To avoid cyclic imports, I had to move the `AuthenticationStrategy`
const to the `internal/types` package and rename it to
`SourceAuthenticationStrategy` so it's clear.
## Test plan
Add a GitHub app as a credential, then create a changeset targeting the
repo that the GitHub app is installed on.
The final commit should be signed.
## Changelog
<!-- OPTIONAL; info at
https://www.notion.so/sourcegraph/Writing-a-changelog-entry-dd997f411d524caabf0d8d38a24a878c
-->
This is part of the Keyword GA Project.
Batch Changes uses Sourcegraph queries to define the list of repositories on which the batch change will run.
With this change we default to pattern type "keyword" instead of "standard".
To make this a backward compatible change, we also introduce a version identifier to batch specs. Authors can specify `version: 2` in the spec, in which case we default to pattern type "keyword". Existing specs (without a specified version) and specs with `version: 1` will keep using pattern type "standard".
Notes:
- Corresponding doc update [PR](https://github.com/sourcegraph/docs/pull/477)
- We don't have a query input field, but instead the query is defined in a batch spec YAML. It didn't feel right to edit the YAML and append "patternType: " on save, which is what we do for Code Monitors and Insights.
- I misuse the pattern type query parameter to effectively override the version. Once we introduce "V4" we should come back here and clean up. I left a TODO in the code.
Test plan:
- New and updated unit tests
- manual testing
- new batch changes use `version: 2` by default.
- using an unsupported version returns an error
- I ran various "on:" queries to verify that version 2 uses keyword search and version 1 uses standard search.
Closes SRCH-663
This is a follow-up to previous PRs, where we added database fields to
support the new github apps integration.
See initiative "Batch Changes using GitHub App auth" on linear.
## Test plan
- Manual testing
## Changelog
<!-- OPTIONAL; info at
https://www.notion.so/sourcegraph/Writing-a-changelog-entry-dd997f411d524caabf0d8d38a24a878c
-->
---------
Co-authored-by: Bolaji Olajide <25608335+BolajiOlajide@users.noreply.github.com>
Historically, sourcegraph.com has been the only instance. It was
connected to GitHub.com and GitLab.com only.
Configuration should be as simple as possible, and we wanted everyone to
try it on any repo. So public repos were added on-demand when browsed
from these code hosts.
Since, dotcom is no longer the only instance, and this is a special case
that only exists for sourcegraph.com.
This causes a bunch of additional complexity and various extra code
paths that we don't test well enough today.
We want to make dotcom simpler to understand, so we've made the decision
to disable that feature, and instead we will maintain a list of
repositories that we have on the instance.
We already disallowed several repos half a year ago, by restricting size
of repos with few stars heavily.
This is basically just a continuation of that.
In the diff, you'll mostly find deletions. This PR does not do much
other than removing the code paths that were only enabled in dotcom mode
in the repo syncer, and then removes code that became unused as a result
of that.
## Test plan
Ran a dotcom mode instance locally, it did not behave differently than a
regular instance wrt. repo cloning.
We will need to verify during the rollout that we're not suddenly
hitting code paths that don't scale to the dotcom size.
## Changelog
Dotcom no longer clones repos on demand.
This method is still quite messy, but here's some incremental
improvement:
- The unused unique_ref parameter has been removed to simplify
- The git_apply_args parameter has been replaced with
patch_filenames_no_prefix, to not allow arbitrary arguments to be passed
from clients for improved security
- The repo existence check now lives in the gRPC layer like for all
other resolvers
- Added a docstring to describe how it works
Test plan:
Integration tests are still passing, created a batch change locally to
verify it can still correctly create a PR.
With this patch, the `errors.HasType` API behaves similar to `Is` and `As`,
where it checks the full error tree instead of just checking a linearized version
of it, as cockroachdb/errors's `HasType` implementation does not respect
multi-errors.
As a consequence, a bunch of relationships between HasType and Is/As that
you'd intuitively expect to hold are now true; see changes to `invariants_test.go`.
This PR is a result/followup of the improvements we've made in the [SAMS repo](https://github.com/sourcegraph/sourcegraph-accounts/pull/199) that allows call sites to pass down a context (primarily to indicate deadline, and of course, cancellation if desired) and collects the error returned from `background.Routine`s `Stop` method.
Note that I did not adopt returning error from `Stop` method because I realize in monorepo, the more common (and arguably the desired) pattern is to hang on the call of `Start` method until `Stop` is called, so it is meaningless to collect errors from `Start` methods as return values anyway, and doing that would also complicate the design and semantics more than necessary.
All usages of the the `background.Routine` and `background.CombinedRoutines` are updated, I DID NOT try to interpret the code logic and make anything better other than fixing compile and test errors.
The only file that contains the core change is the [`lib/background/background.go`](https://github.com/sourcegraph/sourcegraph/pull/62136/files#diff-65c3228388620e91f8c22d91c18faac3f985fc67d64b08612df18fa7c04fafcd).
This PR is the result of quite a journey. You'll see a bunch of meaningless one-liners, and yes - this was a bit frustrating :D
tl;dr I noticed that S2 was fetching repos a ton of times, but the numbers didn't check out. Why would we need 9 fetches a second to keep 1100 repos fresh, of which most are very inactive?
Turns out, we trigger fetches for repos that are marked as modified during code host syncing.
That in itself seems mostly unnecessary (so I reduced the conditions under which that would trigger an immediate update),
and the condition for detecting when a repo was modified was also not correctly computed in a bunch of cases.
These are fixed by one-liners here as well. Most likely, this is not a complete list, but it's what I found locally.
Here's some more writeup:
https://www.notion.so/sourcegraph/May-22-2024-S2-fetches-repos-a-lot-e19c282bb3564140b09086ebce172450?pvs=4
Test plan:
See the notion journey as well, but I verified that the repos are no longer marked as modified erroneously in my local instance.
We want to migrate the LogReverseEach call to gRPC methods as well.
However, it isn't a great practice to have to keep this process running for potentially hours, as any server restart will have to make the entire process start over.
With one call per commit, we're not occupying a process slot on gitserver, and rollouts of gitserver should much less affect rockskip, as individual commits can be retried before being returned to the rockskip code.
Closes#62100
Test plan:
Existing tests still pass, I let it index a repo locally and that was fast and I get results for symbols instantly now, but not a Rockskip expert so requesting review from the owners.
There is potentially a lot of FDs in a repo, so we return an iterator instead to encourage callers to consume it one by one and not load all of it into memory.
Test plan:
Existing tests still pass.
This is only used in one place, so we're moving the implementation to the caller. Gitserver client only exposes gRPC methods without additional logic on top.
Test plan:
Kept the existing tests but rewrote them to test Commits instead so we don't lose coverage for this.
This PR reimplements what CommitLog does using Commits. This reduces the surface area, but also removes an API that currently wasn't respecting sub repo permissions properly.
Another side-effect of this is that we have to make more gRPC requests for indexing, but in return we accumulate less data in memory as we don't need to store all paths in all diffs here.
Test plan:
Existing tests still pass, adjusted test suite as well.
This has only one caller and is now implemented using existing gitserver stuff, so I'm moving this custom response mapping to where it's used. One less API we need to expose via gRPC.
Test plan:
Adjusted tests, and existing E2E tests still pass.
This PR replaces the gitserver client method RevList by the Commits method.
The Commits method can do the same, but it saves us from implementing and testing a second API. Also, the name RevList sounded very generic but it was super opinionated in the flags it uses, so it was really "ListCommitsForRockskip" but that is too application-specific so Commits is probably just as good.
We'll learn a bunch about our API endpoints as we finish migrating them all, so we can do proper optimizations afterwards.
Closes https://github.com/sourcegraph/sourcegraph/issues/62098
Test plan:
Integration and unit test for rockskip are still passing.
This PR consolidates the client methods CommitGraph and Commits into one, to reduce the surface of APIs we have to migrate to gRPC and then maintain.
To my best understanding, this should be equivalent, just use a different API.
Test plan:
Existing test suites and integration tests are still passing.
We want to migrate this call to gRPC. However, it isn't good practice to have to keep this process running for potentially hours, as any server restart will have to make the process start over.
With pagination here, we're not occupying a process slot on gitserver, and rollouts of gitserver should much less affect rockskip, as individual pages can be retried before being returned to the rockskip code.
Test plan:
Existing tests still pass, but not a Rockskip expert so requesting review from the owners.
This PR moves the Fetch and Delete operations onto a separate service.
With that, we also disallow consumers of the internal/gitserver.Client
interface to use these methods. Repo-updaters scheduler has authority
over this and it should not be interfered with. If a caller wants to make
a scheduling _suggestion_, it will be able to talk to repo-updater.
This prepares for a better scheduler that has full control of the operations
running.
Test plan:
E2E and integration tests will verify that repos are still cloned properly.
This PR migrates the Diff method to use a proper gRPC method. I decided to call it RawDiff as it streams back the raw diff format.
Later, we can decide to also add a structured endpoint, but for now we'll let the caller decide what to do with the diff string.
There's probably a use-case for both versions.
Closes https://github.com/sourcegraph/sourcegraph/issues/61868
Test plan:
Added tests across all the layers, verified locally that the diff view works.
This PR migrates the ref specific calls to a gRPC endpoint.
While looking at this, I realized that we have many methods that want the same thing in the end: refs.
So instead of adding various endpoints (that currently use porcelain commands in the client implementation), we now use the plumbing command `for-each-ref` for all of them.
By that, we can combine the following APIs:
- `ListRefs` -> `ListRefs` (stays the same)
- `ListBranches` -> `ListBranches(headsOnly: true)`
- `ListTags` -> `ListBranches(tagsOnly: true)`
- For endpoints that were interested in both branches and tags, we allow the combination of the two `only` flags
- `BranchesContaining` -> `ListRefs(contains: sha)`
- `RefDescriptions` -> `ListRefs(headsOnly: true, tagsOnly: true)`
Through one simple call to `for-each-ref`, we also get to enjoy consistent ordering throughout. This also fixes an issue and improves perf for the branch picker where ordering would be random once the count hits 1000.
Closes https://github.com/sourcegraph/sourcegraph/issues/55074
Closes https://github.com/sourcegraph/sourcegraph/issues/60415
## Test plan
Added tests at various layers, for gitcli, for the gRPC server, for the new client.
Also, existing tests still pass after this migration, including integration and E2E tests.
Manually ran my instance and clicked around as well, to verify.
Diff already has an option for scoping to a specific path, so merging these two together, to simplify the API surface we need to create for the new gRPC APIs around diffs.
## Test plan
Adjusted a few tests, hoping that integration and E2E tests cover this. Also ran some git commands locally to check if they produce roughly equivalent output.
Code reviews will help as well.
Was just used for one marshal operation in testing, so this simplifies the dependency graph by not importing from cmd/frontend.
Test plan:
Tests still pass, code compiles.
I _ALWAYS_ stumble over this double negation. "If not noensurerevision". huh?
Also, we should not encourage people to accidentally make the server do a resolve revision call which is generally costly, so the cheap option as the default seems more desirable.
For now, I haven't changed any of the values, but we should also do another review of the cases where we explicitly set it to true.
Test plan:
Just converted a bool, CI and E2E tests should catch any issues.
This was only used for a single purpose that can be achieved in a different way, but would provide a path around our scheduler which makes scheduling decisions harder to understand.
These clones would also not have stuck to the global concurrency limit.
Test plan:
Adjusted tests, tried a reclone locally and it worked.
observation.TestContextTB is better to use since your logs will be
scoped to your test and it will use a more pedantic prometheus registry.
To be honest TestContext should be removed but this is the first step.
This is a mechanical change. I replaced "&observation.TestContext" with
"observation.TestContextTB(t)". I then undid the change each time it
caused a compilation error (was only a handful of times).
Test Plan: go test
To bring Gerrit support more in line with other code hosts and because a customer recently ran into this limitation, this PR adds support for the SSH flag.
The diff is mostly straightforward, with two things to watch out for:
- The clone URL construction in makeRepo was wrong previously, it missed the `/a/`. This is only used for visuals, but still annoying. So I moved the whole construction into here from the gitserver cloneURL package.
- The SSH hostname and port are configurable in Gerrit, so to construct the right URL we need to fetch some instance metadata. This would be costly to do in the gitserver method, so we persist all the info needed to construct clone URLs "offline" during the cloning process by storing all the data for HTTP and SSH clones on the repo metadata column. This is mostly in line with other code hosts as well, see GitLab right above in the gitserver/cloneurl package.
Closes https://github.com/sourcegraph/sourcegraph/issues/60597
## Test plan
Added tests for the various stages, recreated recorded responses, and tried both HTTP and SSH cloning locally, both worked with out Gerrit instance.
The only value we ever passed here was 1 second, which is effectively "don't debounce".
So we can simplify here.
Really, the client should make sure to go at a reasonable pace, and gitserver on the
other end of the wire can have ultimate authority over if it's time for an update or
not.
All existing tests still pass, integration and E2E tests are passing.
Now that we've updated to Go 1.22, we don't need to copy loop variables before
using them in goroutines.
I found these using the regex searches `go func\(\w+` and `\.Go(func\(\w+`. I
also simplified some non-loop vars when it made sense.
## Test plan
Straight refactor, covered by existing tests