Add "life of a search query" documentation (#5574)

This commit is contained in:
Nick Snyder 2019-09-17 12:39:50 -07:00 committed by GitHub
parent fb9ab7bd92
commit 5986edf311
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 124 additions and 143 deletions

View File

@ -61,7 +61,7 @@ For detailed instructions and troubleshooting, see the [local development docume
The `docs` folder has additional documentation for developing and understanding Sourcegraph:
- [Project FAQ](./doc/admin/faq.md)
- [Architecture](./doc/dev/architecture.md): high-level architecture
- [Architecture](./doc/dev/architecture/index.md): high-level architecture
- [Database setup](./doc/dev/postgresql.md): database setup and best practices
- [General style guide](./doc/dev/style_guide.md)
- [Code style guide](./doc/dev/code_style_guide.md)

9
cmd/frontend/README.md Normal file
View File

@ -0,0 +1,9 @@
# frontend
The frontend serves our web application and hosts our [GraphQL API](../api/graphql/index.md).
Typically there are multiple replicas running in production to scale with load.
Application data is stored in our PostgreSQL database.
Session data is stored in the Redis store, and non-persistent data is stored in the Redis cache.

View File

@ -0,0 +1,5 @@
# github-proxy
Proxies all requests to github.com to keep track of rate limits and prevent triggering abuse mechanisms.
There is only one replica running in production.

View File

@ -1,5 +1,7 @@
# gitserver
Mirrors repositories from their code host. All other Sourcegraph services talk to gitserver when they need data from git. Requests for fetch operations, however, go through repo-updater.
gitserver exposes an "exec" API over HTTP for running git commands against
clones of repositories. gitserver also exposes APIs for the management of
clones.
@ -26,3 +28,11 @@ When doing an operation on a file or directory which may be concurrently
read/written please use atomic filesystem patterns. This usually involves
heavy use of `os.Rename`. Search for existing uses of `os.Rename` to see
examples.
#### Scaling
gitserver's memory usage consists of short lived git subprocesses.
This is an IO and compute heavy service since most Sourcegraph requests will trigger 1 or more git commands. As such we shard requests for a repo to a specific replica. This allows us to horizontally scale out the service.
The service is stateful (maintaining git clones). However, it only contains data mirrored from upstream code hosts.

View File

@ -0,0 +1,3 @@
# query-runner
Periodically runs saved searches, determines the difference in results, and sends notification emails. It is a singleton service by design so there must only be one replica.

View File

@ -0,0 +1,3 @@
# repo-updater
Repo-updater tracks the state of repos, and is responsible for automatically scheduling updates ("git fetch" runs) using gitserver. Other apps which desire updates or fetches should be telling repo-updater, rather than using gitserver directly, so repo-updater can take their changes into account. It is a singleton service by design, so there must only be one replica.

7
cmd/searcher/README.md Normal file
View File

@ -0,0 +1,7 @@
# searcher
Provides on-demand unindexed search for repositories. It scans through a git archive fetched from gitserver to find results, similar in nature to `git grep`.
This service should be scaled up the more on-demand searches that need to be done at once. For a search the frontend will scatter the search for each repo@commit across the replicas. The frontend will then gather the results. Like gitserver this is an IO and compute bound service. However, its state is just a disk cache which can be lost at anytime without being detrimental.
[Life of a search query](../../doc/dev/architecture/life-of-a-search-query.md)

9
cmd/symbols/README.md Normal file
View File

@ -0,0 +1,9 @@
# symbols
Indexes symbols in repositories using [Ctags](https://github.com/universal-ctags/ctags). Similar in architecture to searcher, except over ctags output.
The ctags output is stored in SQLite files on disk (one per repository@commit). Ctags processing is lazy, so it will occur only when you first query the symbols service. Subsequent queries will use the cached on-disk SQLite DB.
It is used by [basic-code-intel](https://github.com/sourcegraph/sourcegraph-basic-code-intel) to provide the jump-to-definition feature.
It supports regex queries, with queries of the form `^foo$` optimized to perform an index lookup (basic-code-intel takes advantage of this).

View File

@ -1,32 +0,0 @@
# Sourcegraph Architecture diagram
```mermaid
graph LR
Frontend-- HTTP -->gitserver
searcher-- HTTP -->gitserver
query-runner-- HTTP -->Frontend
query-runner-- Graphql -->Frontend
repo-updater-- HTTP -->github-proxy
github-proxy-- HTTP -->github[github.com]
repo-updater-- HTTP -->codehosts[Code hosts: GitHub Enterprise, Bitbucket, etc.]
repo-updater-->redis-cache
Frontend-- HTTP -->query-runner
Frontend-->redis-cache["Redis (cache)"]
Frontend-- SQL -->db[Postgresql Database]
Frontend-->redis["Redis (session data)"]
Frontend-- HTTP -->searcher
Frontend-- HTTP ---repo-updater
Frontend-- net/rpc -->indexed-search
indexed-search[indexed-search/zoekt]-- HTTP -->Frontend
repo-updater-- HTTP -->gitserver
react[React App]-- Graphql -->Frontend
react[React App]-- Sourcegraph extensions -->Frontend
browser_extensions[Browser Extensions]-- Graphql -->Frontend
browser_extensions[Browser Extensions]-- Sourcegraph extensions -->Frontend
```

View File

@ -1,109 +0,0 @@
# Sourcegraph Architecture Overview
This is a high level overview of our architecture at Sourcegraph so you can understand how our services fit together.
![Sourcegraph architecture](img/architecture.svg)
**Note**: Non-sighted users can view a [text-representation of this diagram](architecture-mermaid.md).
<!--
Updating the architecture image
TODO: Automate this or replace mermaidjs diagrams
TLDR: Get @ryan-blunden to render a new svg after making changes to architecture.mermaid.
After changing architecture.mermaid, render the new diagram at https://mermaidjs.github.io/mermaid-live-editor/, set "theme" to be "neutral" in the config textarea, then download and replace img/architecture.svg. But there's one more step.
if you try rendering the downloaded SVG as is, the text is cut off in most boxes. This is because the downloaded SVG is missing font styles that were present in the live editor page.
To fix, open the new architecture.svg, then add the following to the first class (`#mermaid-numbers .label`).
font-size: 14px;
font-variant: tabular-nums;
line-height: 1.5;
Save architecture.svg, view architecture.md and the labels should now render correctly.
-->
## Services
Here are the services that compose Sourcegraph.
### frontend ([code](https://github.com/sourcegraph/sourcegraph/tree/master/cmd/frontend))
The frontend serves our [web app](web_app.md) and hosts our [GraphQL API](../api/graphql/index.md).
Application data is stored in our Postgresql database.
Session data is stored in Redis.
#### Scaling
Typically there are multiple replicas running in production to scale with load.
frontend tends to use a large amount of memory. For example our search architecture does a scatter and gather amongst the search backends in the frontend. The gathering of results can result in a lot of memory usage, even though the final result set returned to the user is much smaller. There are a few more examples of these since our frontend has a monolithic architecture. Additionally we haven't optimized for memory usage since it hasn't caused us issues in production since we can just scale it out.
### github-proxy ([code](https://github.com/sourcegraph/sourcegraph/tree/master/cmd/github-proxy))
Proxies all requests to github.com to keep track of rate limits and prevent triggering abuse mechanisms.
There is only one replica running in production. However, we can have multiple replicas to increase our rate limits (rate limit is per IP).
### gitserver ([code](https://github.com/sourcegraph/sourcegraph/tree/master/cmd/gitserver))
Mirrors repositories from their code host. All other Sourcegraph services talk to gitserver when they need data from git. Requests for fetch operations, however, should go through repo-updater.
#### Scaling
gitserver's memory usage consists of short lived git subprocesses.
This is an IO and compute heavy service since most Sourcegraph requests will trigger 1 or more git commands. As such we shard requests for a repo to a specific replica. This allows us to horizontally scale out the service.
The service is stateful (maintaining git clones). However, it only contains data mirrored from upstream code hosts.
### Sourcegraph extensions
[Sourcegraph extensions](../extensions/index.md) add features to Sourcegraph, including language support. Many extensions rely, in turn, on language servers (implementing the [Language Server Protocol](https://microsoft.github.io/language-server-protocol/)) to provide code intelligence (hover tooltips, jump to definition, find references).
### query-runner ([code](https://github.com/sourcegraph/sourcegraph/tree/master/cmd/query-runner))
Periodically runs saved searches and sends notification emails. Only one replica should be running.
### repo-updater ([code](https://github.com/sourcegraph/sourcegraph/tree/master/cmd/repo-updater))
Repo-updater (which may get renamed since it does more than that) tracks the state of repos, and is responsible for automatically scheduling updates ("git fetch" runs) using gitserver. Other apps which desire updates or fetches should be telling repo-updater, rather than using gitserver directly, so repo-updater can take their changes into account. Only one replica should be running.
### searcher ([code](https://github.com/sourcegraph/sourcegraph/tree/master/cmd/searcher))
Provides on-demand search for repositories. It scans through a git archive fetched from gitserver to find results.
This service should be scaled up the more on-demand searches that need to be done at once. For a search the frontend will scatter the search for each repo@commit across the replicas. The frontend will then gather the results. Like gitserver this is an IO and compute bound service. However, its state is a cache which can be lost at anytime.
### indexed-search/zoekt ([code](https://github.com/sourcegraph/zoekt))
Provides search results for repositories that have been indexed.
This service can only have one replica. Typically large customers provision a large node for it since it is memory and CPU heavy. Note: We could shard across multiple replicas to scale out. However, we haven't had a customer were this is necessary yet so haven't written the code for it yet.
We forked [zoekt](https://github.com/google/zoekt) to add some Sourcegraph specific integrations. See our [fork's README](https://github.com/sourcegraph/zoekt/blob/master/README.md) for details.
### symbols ([code](https://github.com/sourcegraph/sourcegraph/tree/master/cmd/symbols))
Indexes symbols in repositories using Ctags. Similar in architecture to searcher, except over ctags output.
### syntect ([code](https://github.com/sourcegraph/syntect_server))
Syntect is a Rust service that is responsible for syntax highlighting.
Horizontally scalable, but typically only one replica is necessary.
### Browser extensions ([code](https://github.com/sourcegraph/sourcegraph/tree/master/browser) | [docs](https://docs.sourcegraph.com/integration/browser_extension))
We publish browser extensions for Chrome, Firefox, and Safari, that provide code intelligence (hover tooltips, jump to definition, find references) when browsing code on code hosts. By default it works for open-source code, but it also works for private code if your company has a Sourcegraph deployment.
It uses GraphQL APIs exposed by the frontend to fetch data.
### Editor extensions ([docs](https://docs.sourcegraph.com/integration/editor))
Our editor extensions provide lightweight hooks into Sourcegraph, currently.

View File

@ -0,0 +1,44 @@
# Sourcegraph Architecture Overview
This is a high level overview of Sourcegraph's architecture so you can understand how our systems fit together.
## Clients
We maintain multiple Sourcegraph clients:
- [Web application](https://github.com/sourcegraph/sourcegraph/tree/master/web)
- [Browser extensions](https://github.com/sourcegraph/sourcegraph/tree/master/browser)
- [src-cli](https://github.com/sourcegraph/src-cli)
- [Editor integrations](https://docs.sourcegraph.com/integration/editor)
- [Visual Studio Code](https://github.com/sourcegraph/sourcegraph-vscode)
- [Atom](https://github.com/sourcegraph/sourcegraph-atom)
- [IntelliJ](https://github.com/sourcegraph/sourcegraph-jetbrains)
- [Sublime](https://github.com/sourcegraph/sourcegraph-sublime)
These clients generally communicate with a Sourcegraph instance (either https://sourcegraph.com or a private customer instance) through our [GraphQL API](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/frontend/graphqlbackend/schema.graphql). There are also a small number of REST endpoints for specific use-cases.
## Services
Our backend is composed of multiple services:
- Most are Go services found in the [cmd](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/cmd) folder.
- [Syntect server](https://sourcegraph.com/github.com/sourcegraph/syntect_server) is our syntax highlighting service written in Rust. It is not horizontally scalable so only 1 replica is supported.
- [LSIF server](https://github.com/sourcegraph/sourcegraph/tree/master/lsif/server) provide precise code intelligence based on the LISF data format. It is written in TypeScript.
- [zoekt-indexserver](https://sourcegraph.com/github.com/sourcegraph/zoekt/-/tree/cmd/zoekt-sourcegraph-indexserver) and [zoekt-webserver](https://sourcegraph.com/github.com/sourcegraph/zoekt/-/tree/cmd/zoekt-webserver) provide indexed search. It is written in Go.
## Infrastructure
- [sourcegraph/infrastructure](https://sourcegraph.com/github.com/sourcegraph/infrastructure) contains Terraform configurations for Cloudflare DNS and Site 24x7 monitoring, as well as build steps for various docker images. Only private docker images should be added here, public ones belong in the main repository.
- [sourcegraph/deploy-sourcegraph](https://github.com/sourcegraph/deploy-sourcegraph) contains YAML that can be used by customers to deploy Sourcegraph to a Kubernetes cluster.
- [sourcegraph/deploy-sourcegraph-docker](https://github.com/sourcegraph/deploy-sourcegraph-docker) contains a pure-Docker cluster deployment reference that some one-off customers use to deploy Sourcegraph to a non-Kubernetes cluster.
- [sourcegraph/deploy-sourcegraph-dot-com](https://github.com/sourcegraph/deploy-sourcegraph-dot-com) is a fork of the above that is used to deploy to the Kubernetes cluster that serves https://sourcegraph.com.
## Guides
Here are some guides to help you understand how multiple systems fit together:
- [Life of a search query](life-of-a-search-query.md)
- Future topics we will cover here:
- Life of a repository (i.e. how does code end up on gitserver?)
- Sourcegraph extension architecture
- Web app and browser extension architecture

View File

@ -0,0 +1,32 @@
# Life of a search query
This document describes how our backend systems serve search results to clients. There are multiple kinds of searches (e.g. text, repository, file, symbol, diff, commit), but this document will focus on text searches.
## Clients
There are a few ways to perform a search with Sourcegraph:
1. Typing a query into the search bar of the Sourcegraph web application.
2. Typing a query into your browser's location bar after configuring a [browser search engine shortcut](https://docs.sourcegraph.com/integration/browser_search_engine).
3. Using the [src CLI command](https://github.com/sourcegraph/src-cli).
In all cases, clients use the [search query](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+%5Cbsearch%5C%28+file:schema.graphql) in our GraphQL API that is exposed by our [frontend](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/cmd/frontend) service.
## Frontend
The frontend implements the GraphQL search resolver [here](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+"func+%28r+*schemaResolver%29+Search%28").
First, the frontend [resolves which repositories need to be searched](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+%22func+%28r+*searchResolver%29+resolveRepositories%28%22). It parses the query for any repository filters and then queries the database for the list of repositories that match those filters. If no filters are provided then all repositories are searched, as long as the number of repositories doesn't exceed the configured limit. Private instances default to an unlimited number of repositories, but sourcegraph.com has smaller configured limit (`"maxReposToSearch": 400` at the time of writing, but you can check the [site config for the current value](https://sourcegraph.com/site-admin/configuration)) becuase it isn't cost effective for us to to search/index all open source code on GitHub.
Next, the frontend [determines which repository@revision combinations are indexed by zoekt](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+"zoektIndexedRepos%28"+file:textsearch%5C.go) by [consulting an in-memory cache that is kept up-to-date with regular asynchronous polling](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+"%29+start%28"+file:text.go). It concurrently [queries zoekt](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+%22zoektSearchHEAD%28%22+file:textsearch%5C.go) for indexed repositories and [queries searcher](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+"+searchFilesInRepo%28"+file:textsearch%5C.go) for non-indexed repositories.
## Zoekt (indexed search)
zoekt-webserver [serves search requests](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/zoekt%24+"serveSearchErr%28") by [iterating through matches in the index](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/zoekt%24+"func+%28d+*indexData%29+Search"). It watches the index directory and loads/unloads index files as they come and go.
To decide what to index [zoekt-sourcegraph-indexserver](https://sourcegraph.com/github.com/sourcegraph/zoekt/-/tree/cmd/zoekt-sourcegraph-indexserver) sends an [HTTP Get request to the frontend internal API](https://sourcegraph.com/search?q=r:github.com/sourcegraph/+%22/repos/list%22+-file:%28test%7Cspec%29+) for a list of repository names to index. For each repository the indexserver will compare what Sourcegraph wants indexed (commit, configuration, etc.) to what is already indexed on disk and will start an index job for anything that is missing. It only maintains an index of the latest commit on the default branch. It fetches git data by calling [another internal frontend API](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/zoekt%24+"func+tarballURL") which [redirects to the archive on gitserver](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+"func+serveGitTar%28"+).
## Searcher (non-indexed search)
Searcher is a horizontally scalable stateless service that performs non-indexed code search. Each request is a search on a single repository (the frontend searches multiple respositories by sending one concurrent request per repository). To serve a [search request](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+file:search/search.go+"s.search"), it first [fetches a zip archive of the repo at the desired commit](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/cmd/searcher/search/search.go#L190-199) [from gitserver](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+%22FetchTar:%22+file:searcher/main.go) and then [iterates through the files in the archive](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+%22func+concurrentFind%28%22) to perform the actual search.

View File

@ -14,7 +14,7 @@ Sourcegraph development is open source at [github.com/sourcegraph/sourcegraph](h
### Technical
- [Quickstart](local_development.md)
- [Architecture](architecture.md)
- [Architecture](architecture/index.md)
- [Developing the web app](web_app.md)
- [Developing the GraphQL API](graphql_api.md)
- [Using PostgreSQL](postgresql.md)