Historically, sourcegraph.com has been the only instance. It was connected to GitHub.com and GitLab.com only. Configuration should be as simple as possible, and we wanted everyone to try it on any repo. So public repos were added on-demand when browsed from these code hosts. Since, dotcom is no longer the only instance, and this is a special case that only exists for sourcegraph.com. This causes a bunch of additional complexity and various extra code paths that we don't test well enough today. We want to make dotcom simpler to understand, so we've made the decision to disable that feature, and instead we will maintain a list of repositories that we have on the instance. We already disallowed several repos half a year ago, by restricting size of repos with few stars heavily. This is basically just a continuation of that. In the diff, you'll mostly find deletions. This PR does not do much other than removing the code paths that were only enabled in dotcom mode in the repo syncer, and then removes code that became unused as a result of that. ## Test plan Ran a dotcom mode instance locally, it did not behave differently than a regular instance wrt. repo cloning. We will need to verify during the rollout that we're not suddenly hitting code paths that don't scale to the dotcom size. ## Changelog Dotcom no longer clones repos on demand. |
||
|---|---|---|
| .. | ||
| internal | ||
| shared | ||
| BUILD.bazel | ||
| image_test.yaml | ||
| main.go | ||
| README.md | ||
gitserver
Mirrors repositories from their code host. All other Sourcegraph services talk to gitserver when they need data from git. Requests for fetch operations, however, go through repo-updater.
gitserver exposes an "exec" API over HTTP for running git commands against clones of repositories. gitserver also exposes APIs for the management of clones.
The management of clones comprises most of the complexity in gitserver since:
- We want to avoid concurrent clones and fetches of the same repository.
- We want to limit the number of concurrent clones and fetches.
- When adding/removing/modifying a clone, concurrent attempts to run commands needs to be gracefully dealt with.
- We need to be robust against the many ways git clones can degrade. (gc, interrupted clones)
Additionally we have invested heavily in the observability of gitserver. Nearly every operation Sourcegraph does runs one or more git commands. So we have detailed observability in prometheus, net/event, jaeger, honeycomb and stderr logs.
We normalize repository names when storing them on disk. Always use
protocol.NormalizeRepo. The $GIT_DIR of a repository is at
reposRoot/normalized_name/.git.
When doing an operation on a file or directory which may be concurrently
read/written please use atomic filesystem patterns. This usually involves
heavy use of os.Rename. Search for existing uses of os.Rename to see
examples.
Scaling
gitserver's memory usage consists of short lived git subprocesses.
This is an IO and compute heavy service since most Sourcegraph requests will trigger 1 or more git commands. As such we shard requests for a repo to a specific replica. This allows us to horizontally scale out the service.
The service is stateful (maintaining git clones). However, it only contains data mirrored from upstream code hosts.
Perforce depots
Syncing of Perforce depots is accomplished by either p4-fusion or git p4 (deprecated), both of which clone Perforce depots into Git repositories in gitserver.
p4-fusion in development
To use p4-fusion while developing Sourcegraph, there are a couple of options.
Docker
Run gitserver in a Docker container. This is the option that gives an experience closest to a deployed Sourcegraph instance, and will work for any platform/OS on which you're developing (running sg start).
Bazel
Native binaries are provided through Bazel, built via Nix in our fork of p4-fusion. It can be invoked either through ./dev/p4-fusion-dev or directly with bazel run //dev/tools:p4-fusion.