From 71de68e851a567cc8d171ea5d7e864387ac72eb3 Mon Sep 17 00:00:00 2001 From: Chris Wendt Date: Fri, 25 Feb 2022 18:28:28 -0700 Subject: [PATCH] Update symbols docs (#31600) --- cmd/symbols/README.md | 2 +- doc/admin/how-to/monorepo-issues.md | 7 +------ .../explanations/features.md | 18 ++++++++++++++++ .../search_based_code_intelligence.md | 21 +++++++++++++++++++ doc/code_search/explanations/features.md | 2 +- 5 files changed, 42 insertions(+), 8 deletions(-) diff --git a/cmd/symbols/README.md b/cmd/symbols/README.md index fe0a26db966..5b917df3e2f 100644 --- a/cmd/symbols/README.md +++ b/cmd/symbols/README.md @@ -6,4 +6,4 @@ The ctags output is stored in SQLite files on disk (one per repository@commit). It is used by [basic-code-intel](https://github.com/sourcegraph/sourcegraph-basic-code-intel) to provide the jump-to-definition feature. -It supports regex queries, with queries of the form `^foo$` optimized to perform an index lookup (basic-code-intel takes advantage of this). +It supports regex queries, with prefix queries (`^foo`) and exact match queries (`^foo$`) optimized to perform index lookups. The symbols sidebar and search-based code intel benefit from these optimizations. diff --git a/doc/admin/how-to/monorepo-issues.md b/doc/admin/how-to/monorepo-issues.md index ff3c18ce1bc..0a995e25d6e 100644 --- a/doc/admin/how-to/monorepo-issues.md +++ b/doc/admin/how-to/monorepo-issues.md @@ -18,12 +18,7 @@ The following bullets provide a general guidline to which service may require mo If you are regularly seeing the `Processing symbols is taking longer than expected. Try again in a while` warning in your sidebar, its likely that your symbols and/or gitserver services are underprovisioned and need more CPU/mem resources. -The [symbols sidebar](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/repo/RepoRevisionSidebarSymbols.tsx?L42) is dependent on the symbols and gitserver services. Upon opening the symbols sidebar, a search query is made to the GraphQL API to retrieve the symbols associated with the current git commit. There are a few different query paths: - -- If indexed search is enabled and Zoekt has [an index for the commit](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+file:%5Einternal/search/symbol/symbol%5C.go+if+branch+:%3D+indexedSymbolsBranch%28&patternType=literal), then the query should be resolved quickly. The high commit frequency of monorepos reduces the likelihood that Zoekt will have an index for any given commit. Zoekt only has one index per branch, and usually only the default branch is indexed. Zoekt **eagerly** indexes the tip, whereas the symbols service **lazily** indexes whichever commit you're on. -- If the symbols service has already indexed this commit (i.e. someone has visited the commit before) and it is [cached](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24%406f4d327+file:symbols+cache.Open&patternType=literal), then the query should be resolved quickly. -- If the symbols service has already indexed a **different** commit in the same repository, then it will copy a previous index on disk, run [`git diff --name-status`](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+f:symbols+--name-status&patternType=regexp) to get the list of files that changed, run [`git archive files...`](https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+file:%5Einternal/gitserver/client%5C.go+func+%28c+*Client%29+Archive%28&patternType=literal) to fetch the file contents, run [ctags](https://github.com/universal-ctags/ctags#readme) on those files, and update the symbols index. The query should be resolved in a few seconds (e.g. 300ms on the Kubernetes repo with 4M LOC and 15K files). -- If the symbols service has never seen this repository before, then it needs to process all symbols before being able to respond to the query. The query will likely time out and trigger the error in the screenshot. Processing all symbols can take minutes (e.g. 1m8s on the Kubernetes repo with 4M LOC and 15K files). +The [symbols sidebar](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/blob/client/web/src/repo/RepoRevisionSidebarSymbols.tsx?L42) is dependent on the symbols and gitserver services. Upon opening the symbols sidebar, a search query is made to the GraphQL API to retrieve the symbols associated with the current git commit. You can read more about the [symbol search behavior and performance](../../code_intelligence/explanations/features.md#symbol-search-behavior-and-performance). To address this concern, allocate more resources to the symbols service (to provide more processing power for indexing operations) and allocate more resources to the gitserver (to provide for the extra load associated with responding to fetch requests from symbols, and speed up sending the large repo). diff --git a/doc/code_intelligence/explanations/features.md b/doc/code_intelligence/explanations/features.md index 5f9a3b0c2c2..c3c576d6ac0 100644 --- a/doc/code_intelligence/explanations/features.md +++ b/doc/code_intelligence/explanations/features.md @@ -51,3 +51,21 @@ We use [Ctags](https://github.com/universal-ctags/ctags) to index the symbols of We use [Ctags](https://github.com/universal-ctags/ctags) to index the symbols of a repository on-demand. These symbols are also used for the symbol sidebar, which categorizes declarations by type (variable, function, interface, etc). Clicking on a symbol in the sidebar jumps you to the line where it is defined. + +### Symbol search behavior and performance + +Here is the query path for symbol searches: + +- **Zoekt**: if [indexed search](../../admin/search.md#indexed-search) is enabled and the search is for the tip commit of an indexed branch, then Zoekt will service the query and it should respond quickly. Zoekt indexes the default branch (usually `master` or `main`) and can be configured for [multi-branch indexing](https://docs.sourcegraph.com/code_search/explanations/features#multi-branch-indexing-experimental). The high commit frequency of monorepos reduces the likelihood that Zoekt will be able to respond to symbol searches. Zoekt **eagerly** indexes by listening to repository updates, whereas the symbols service **lazily** indexes the commit being searched. +- **Symbols service with an index for the commit**: if the symbols service has already indexed this commit (i.e. someone has visited the commit before) then the query should be resolved quickly. Indexes are deleted in LRU fashion to remain under the configured maximum disk usage which [defaults to 100GB](./search_based_code_intelligence.md#what-configuration-settings-can-i-apply). +- **Symbols service with an index for a different commit**: if the symbols service has already indexed a **different** commit in the same repository, then it will make a copy of the previous index on disk then run [ctags](https://github.com/universal-ctags/ctags#readme) on the files that changed between the two commits and update the symbols in the new index. This process takes roughly 20 seconds on a monorepo with 40M LOC and 400K files. +- **Symbols service without any indexes (cold start)**: if the symbols service has never seen this repository before, then it needs to run ctags on all symbols and construct the index from scratch. This process takes roughly 20 minutes on a monorepo with 40M LOC and 400K files. + +Once the symbols service has built an index for a commit, here's the query performance: + +- Exact matches `^foo$` are optimized to use an index +- Prefix matches `^foo` are optimized to use an index +- General regex queries `foo.*bar` need to scan every symbol +- Path filtering `file:^cmd/` helps narrow the search space + +Search-based code intelligence uses exact matches `^foo$` and the symbols sidebar uses prefix matches on paths `file:^cmd/` to respond quickly. diff --git a/doc/code_intelligence/explanations/search_based_code_intelligence.md b/doc/code_intelligence/explanations/search_based_code_intelligence.md index 0fee568bb6f..f33555f8e5b 100644 --- a/doc/code_intelligence/explanations/search_based_code_intelligence.md +++ b/doc/code_intelligence/explanations/search_based_code_intelligence.md @@ -25,3 +25,24 @@ Are you using a language we don't support? [File a GitHub issue](https://github. Search-based code intelligence uses search-based heuristics, rather than parsing the code into an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) (AST). Incorrect results occur more often for tokens with common names (such as `Get`) than for tokens with more unique names simply because those tokens appear more often in the search index. If you require 100% confidence in accuracy for a definition or reference results for a symbol you hovered over we recommend utilizing precise code intelligence. Scenarios where you may still get search-based code intelligence results even with precision on are described in more detail in the [precise code intelligence docs](./precise_code_intelligence.md). + +## Why does it sometimes time out? + +The [symbol search performance](./features.md#symbol-search-behavior-and-performance) section describes query paths and performance. + +## What configuration settings can I apply? + +The symbols container recognizes these environment variables: + +- `CTAGS_COMMAND`: defaults to `universal-ctags`, ctags command (should point to universal-ctags executable compiled with JSON and seccomp support) +- `CTAGS_PATTERN_LENGTH_LIMIT`: defaults to `250`, the maximum length of the patterns output by ctags +- `LOG_CTAGS_ERRORS`: defaults to `false`, log ctags errors +- `SANITY_CHECK`: defaults to `false`, check that go-sqlite3 works then exit 0 if it's ok or 1 if not +- `CACHE_DIR`: defaults to `/tmp/symbols-cache`, directory in which to store cached symbols +- `SYMBOLS_CACHE_SIZE_MB`: defaults to `100000`, maximum size of the disk cache (in megabytes) +- `CTAGS_PROCESSES`: defaults to `strconv.Itoa(runtime.GOMAXPROCS(0))`, number of concurrent parser processes to run +- `REQUEST_BUFFER_SIZE`: defaults to `8192`, maximum size of buffered parser request channel +- `PROCESSING_TIMEOUT`: defaults to `2h`, maximum time to spend processing a repository +- `MAX_TOTAL_PATHS_LENGTH`: defaults to `100000`, maximum sum of lengths of all paths in a single call to git archive + +The defaults come from [`config.go`](https://github.com/sourcegraph/sourcegraph/blob/eea895ae1a8acef08370a5cc6f24bdc7c66cb4ed/cmd/symbols/config.go#L42-L59). diff --git a/doc/code_search/explanations/features.md b/doc/code_search/explanations/features.md index 9a5d15abd6e..fa937b5b448 100644 --- a/doc/code_search/explanations/features.md +++ b/doc/code_search/explanations/features.md @@ -28,7 +28,7 @@ See our [query syntax](../reference/queries.md#diff-and-commit-searches-only) do ## Symbol search -Searching for symbols makes it easier to find specific functions, variables, and more. Use the `type:symbol` filter to search for symbol results. Symbol results also appear in typeahead suggestions, so you can jump directly to symbols by name. +Searching for symbols makes it easier to find specific functions, variables, and more. Use the `type:symbol` filter to search for symbol results. Symbol results also appear in typeahead suggestions, so you can jump directly to symbols by name. When on an [indexed](../../admin/search.md#indexed-search) commit it uses Zoekt, otherwise it uses the [symbols service](../../code_intelligence/explanations/features.md#symbol-search) ## Saved searches