Code AI platform with Code Search & Cody
Go to file
Rok Novosel f77f0272cf
embeddings: searcher and indexer (#48017)
# High-level architecture overview
<img width="2231" alt="Screenshot 2023-02-24 at 15 13 59"
src="https://user-images.githubusercontent.com/6417322/221200130-53c1ff25-4c47-4532-885f-5c4f9dadb05e.png">


# Embeddings

Really quickly: embeddings are a semantic representation of text.
Embeddings are usually floating-point vectors with 256+ elements. The
neat thing about embeddings is that they allow us to search over textual
information using a semantic correlation between the query and the text,
not just syntactic (matching keywords).

In this PR, we implemented an embedding service that will allow us to do
semantic code search over repositories in Sourcegraph. So, for example,
you'll be able to ask, "how do access tokens work in Sourcegraph", and
it will give you a list of the closest matching code files.

Additionally, we build a context detection service powered by
embeddings. In chat applications, it is important to know whether the
user's message requires additional context. We have to differentiate
between two cases: the user asks a general question about the codebase,
or the user references something in the existing conversation. In the
latter case, including the context would ruin the flow of the
conversation, and the chatbot would most likely return a confusing
answer. We determine whether a query _does not_ require additional
context using two approaches:

1. We check if the query contains well-known phrases that would indicate
the user is referencing the existing conversation (e.g., translate
previous, change that)
1. We have a static dataset of messages that require context and a
dataset of messages that do not. We embed both datasets, and then, using
embedding similarity, we can check which set is more similar to the
query.

## GraphQL API

We add four new resolvers to the GraphQL API:

```graphql
extend type Query {
  embeddingsSearch(repo: ID!, query: String!, codeResultsCount: Int!, textResultsCount: Int!): EmbeddingsSearchResults!
  isContextRequiredForQuery(query: String!): Boolean!
}
extend type Mutation {
  scheduleRepositoriesForEmbedding(repoNames: [String!]!): EmptyResponse!
  scheduleContextDetectionForEmbedding: EmptyResponse!
}
```

- `embeddingsSearch` performs embeddings search over the repo embeddings
and returns the specified number of results
- `isContextRequiredForQuery` determines whether the given query
requires additional context
- `scheduleRepositoriesForEmbedding` schedules a repo embedding
background job
- `scheduleContextDetectionForEmbedding` schedules a context detection
embedding background job that embeds a static dataset of messages.

## Repo embedding background job

Embedding a repository is implemented as a background job. The
background job handler receives the repository and the revision, which
should be embedded. Handler then gathers a list of files from the
gitserver and excludes files >1MB in size. The list of files is split
into code and text files (.md, .txt), and we build a separate embedding
index for both. We split them because in a combined index, the text
files always tended to feature as top results and didn't leave any room
for code files. Once we have the list of files, the procedure is as
follows:

- For each file
  - Get file contents from gitserver
- Check if the file is embeddable (is not autogenerated, is large
enough, does not have long lines)
  - Split the file into embeddable chunks
- Embed the file chunks using an external embedding service (defined in
site config)
  - Add embedded file chunks and metadata to the index
- Metadata contains the file name, the start line, and the end line of
the chunk
- Once all files are processed, the index is marshaled into JSON and
stored in Cloud storage (GCS, S3)

### Site config changes

As mentioned, we use a configurable external embedding API that does the
actual text -> vector embedding part. Ideally, this allows us to swap
embedding providers in the future.

```json
"embeddings": {
  "description": "Configuration for embeddings service.",
  "type": "object",
  "required": ["enabled", "dimensions", "model", "accessToken", "url"],
  "properties": {
    "enabled": {
      "description": "Toggles whether embedding service is enabled.",
      "type": "boolean",
      "default": false
    },
    "dimensions": {
      "description": "The dimensionality of the embedding vectors.",
      "type": "integer",
      "minimum": 0
    },
    "model": {
      "description": "The model used for embedding.",
      "type": "string"
    },
    "accessToken": {
      "description": "The access token used to authenticate with the external embedding API service.",
      "type": "string"
    },
    "url": {
      "description": "The url to the external embedding API service.",
      "type": "string",
      "format": "uri"
    }
  }
}
```

## Repo embeddings search

The repo embeddings search is implemented in its own service. When a
user queries a repo using embeddings search, the following happens:

- Download the repo embedding index from blob storage and cache it in
memory
  - We cache up to 5 embedding indexes in memory
- Embed the query and use the embedded query vector to find similar code
and text file metadata in the embedding index
- Query gitserver for the actual file contents
- Return the results

## Interesting files

- [Similarity
search](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-102cc83520004eb0e2795e49bc435c5142ca555189b1db3a52bbf1ffb82fa3c6)
- [Repo embedding job
handler](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-c345f373f426398beb4b9cd5852ba862a2718687882db2a8b2d9c7fbb5f1dc52)
- [External embedding api
client](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-ad1e7956f518e4bcaee17dd9e7ac04a5e090c00d970fcd273919e887e1d2cf8f)
- [Embedding a
repo](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-1f35118727128095b7816791b6f0a2e0e060cddee43d25102859b8159465585c)
- [Embeddings searcher
service](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-5b20f3e7ef87041daeeaef98b58ebf7388519cedcdfc359dc5e6d4e0b021472e)
- [Embeddings
search](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-79f95b9cc3f1ef39c1a0b88015bd9cd6c19c30a8d4c147409f1b8e8cd9462ea1)
- [Repo embedding index cache
management](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-8a41f7dec31054889dbf86e97c52223d5636b4d408c6b375bcfc09160a8b70f8)
- [GraphQL
resolvers](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-9b30a0b5efcb63e2f4611b99ab137fbe09629a769a4f30d10a1b2da41a01d21f)


## Test plan

- Start by filling out the `embeddings` object in the site config (let
me know if you need an API key)
- Start the embeddings service using `sg start embeddings`
- Go to the `/api/console` page and schedule a repo embedding job and a
context detection embedding job:

```graphql
mutation {
  scheduleRepositoriesForEmbedding(repoNames: ["github.com/sourcegraph/handbook"]) {
    __typename
  }
  scheduleContextDetectionForEmbedding {
    __typename
  }
}
```

- Once both are finished, you should be able to query the repo embedding
index, and determine whether context is need for a given query:

```graphql
query {
  isContextRequiredForQuery(query: "how do access tokens work")
  embeddingsSearch(
    repo: "UmVwb3NpdG9yeToy", # github.com/sourcegraph/handbook GQL ID
    query: "how do access tokens work", 
    codeResultsCount: 5,
    textResultsCount: 5) {
    codeResults {
      fileName
      content
    }
    textResults {
      fileName
      content
    }
  }
}
```
2023-03-01 10:50:12 +01:00
.aspect bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
.buildkite bazel: introduce build files for Go (#46770) 2023-01-23 14:00:01 +01:00
.github Fix paths not absolute in test.CODEOWNERS (#48403) 2023-02-28 21:54:49 -05:00
.vscode vscode settings: fix jest extension (#47217) 2023-01-31 09:56:43 -08:00
client web: storm search page implementation (#48262) 2023-03-01 00:43:38 -08:00
cmd embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
dev embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
doc embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
docker-images Highlighting: add tree-sitter support for the most common programming languages (#47571) 2023-02-27 09:34:34 +01:00
enterprise embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
internal embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
lib Executor Job Specific Tokens (#46792) 2023-02-28 18:40:22 +00:00
migrations embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
monitoring alert: mention shard merging in alert's next steps (#47895) 2023-02-24 10:07:23 +01:00
schema embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
third_party bazel: update buildfiles (#47744) 2023-02-16 16:32:59 +01:00
third-party-licenses Consolidate dependencies: remove neelance/parallel (#48159) 2023-02-24 11:24:46 -07:00
ui/assets bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
wolfi-images Add Wolfi base images (#47034) 2023-01-31 12:09:09 +00:00
wolfi-packages Add Wolfi base images (#47034) 2023-01-31 12:09:09 +00:00
.bazelignore bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
.bazeliskrc bazel: add .bazeliskrc with Aspect-CLI (#48411) 2023-03-01 01:27:14 -08:00
.bazelrc Migrate to autogold/v2 (needed by Bazel) (#47891) 2023-02-21 10:37:13 +01:00
.bazelversion build: upgrade bazel to 6.0.0 (#47049) 2023-01-29 17:37:55 +01:00
.browserslistrc web: migrate from yarn to pnpm (#46143) 2023-01-11 19:50:09 -08:00
.dockerignore web: migrate from yarn to pnpm (#46143) 2023-01-11 19:50:09 -08:00
.editorconfig chore: Add .lua to editorconfig. (#44267) 2022-11-11 15:25:32 +08:00
.eslintignore [experiment] Merge SvelteKit prototype into main (#47238) 2023-02-13 17:53:23 +01:00
.eslintrc.js web: remove the remaining uses of useHistory() (#47533) 2023-02-13 03:57:23 -08:00
.gitattributes chore: Fix pattern in .gitattributes to match mock files. (#44331) 2022-11-14 10:45:28 +08:00
.gitignore Sourcegraph App (single-binary branch) (#46547) 2023-01-19 17:35:39 -07:00
.golangci-warn.yml internal/resources-report: remove tool (#46941) 2023-01-25 10:34:59 -08:00
.golangci.yml internal/resources-report: remove tool (#46941) 2023-01-25 10:34:59 -08:00
.graphqlrc.yml Support multiple GraphQL schema files (#20077) 2021-04-19 14:35:49 +02:00
.hadolint.yaml bump comby version to 1.7.1 (#35830) 2022-05-20 20:12:01 -07:00
.mailmap mailmap: update replacements for Joe (#29614) 2022-01-12 10:56:56 +08:00
.mocharc.js bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
.npmrc web: fix pnpm-lock issue (#47478) 2023-02-09 22:04:31 -08:00
.percy.yml Update browser extention installation detection logic on web (#32449) 2022-03-14 23:29:39 +06:00
.prettierignore rework plugin structure and implement frontside blogpost (#46883) 2023-02-15 11:49:51 +02:00
.stylelintignore rework plugin structure and implement frontside blogpost (#46883) 2023-02-15 11:49:51 +02:00
.stylelintrc.json web: drop bootstrap depenedency (#41401) 2022-09-07 03:11:26 -07:00
.tool-versions Bump rust version (#48224) 2023-02-25 16:35:55 +00:00
.trivyignore ci: ignore benign CVE-2021-43816 in prometheus (#31069) 2022-02-11 16:49:10 +00:00
babel.config.jest.js bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
babel.config.js bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
BUILD.bazel bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
CHANGELOG.md blob: make selection-driven nav default (#48066) 2023-02-28 13:00:27 +02:00
CONTRIBUTING.md Docs: Fix docs page link in main CONTRIBUTING.md (#45160) 2022-12-05 14:57:45 +01:00
deps.bzl Consolidate dependencies: remove neelance/parallel (#48159) 2023-02-24 11:24:46 -07:00
doc.go Publish Sourcegraph as open source 🚀 2018-09-30 23:13:36 -07:00
flake.lock nix: use go1.20 (#47541) 2023-02-13 12:19:12 +02:00
flake.nix nix: use go1.20 (#47541) 2023-02-13 12:19:12 +02:00
gen.go chore: Update go-mockgen (#44305) 2022-11-11 19:24:00 +00:00
go.mod embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
go.sum embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
graphql-schema-linter.config.js Support multiple GraphQL schema files (#20077) 2021-04-19 14:35:49 +02:00
gulpfile.js web: drop legacy GraphQL schema generator (#45945) 2022-12-25 18:10:20 -08:00
jest.config.base.js bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
jest.config.js tests: use glob for jest projects field (#29681) 2022-01-13 01:11:52 -08:00
jest.snapshot-resolver.js bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
LICENSE update licensing language (#25620) 2021-10-04 15:40:59 +01:00
LICENSE.apache Move all client code into client/ folder (#14480) 2020-10-07 22:23:53 +02:00
LICENSE.enterprise clarify license (#2543) 2019-03-03 16:39:46 +08:00
lighthouserc.js web: migrate from yarn to pnpm (#46143) 2023-01-11 19:50:09 -08:00
mockgen.temp.yaml embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
mockgen.test.yaml Batch Changes support for Azure DevOps (#47913) 2023-02-27 13:24:36 -05:00
mockgen.yaml mocks: Reorganize mock definitions into multiple files (#36967) 2022-06-27 20:59:16 +00:00
package.json web: storm search page implementation (#48262) 2023-03-01 00:43:38 -08:00
pnpm-lock.yaml bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00
pnpm-workspace.yaml web: sync TS project refenreces (#46407) 2023-01-16 18:55:10 -08:00
postcss.config.js extensibility: add featured extensions to registry (#21665) 2021-06-10 13:55:20 -04:00
prettier.config.js Publish Sourcegraph as open source 🚀 2018-09-30 23:13:36 -07:00
README.md Improve Markdown rendering (#47074) 2023-01-30 13:36:56 -08:00
renovate.json chore: add test plans to bot and release tool PRs (#31351) 2022-02-22 07:53:25 -08:00
SECURITY.md consolidate security policy (#7906) 2020-01-21 10:03:11 -08:00
service-catalog.yaml lib/servicecatalog: init to distribute catalog (#46999) 2023-01-26 17:22:27 -08:00
sg.config.yaml embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
shell.nix nix: fix nodejs version used by pnpm (#47680) 2023-02-15 21:33:31 +02:00
svgo.config.js Performance: Optimize static SVG assets with SVGO (#26285) 2021-10-27 15:27:36 +01:00
tsconfig.all.json web: fix pnpm-lock issue (#47478) 2023-02-09 22:04:31 -08:00
tsconfig.base.json web: fix pnpm-lock issue (#47478) 2023-02-09 22:04:31 -08:00
tsconfig.eslint.json web: fix pnpm-lock issue (#47478) 2023-02-09 22:04:31 -08:00
WORKSPACE bazel: add bazel build,tests for client/* (#46193) 2023-02-28 20:46:03 -08:00

Sourcegraph 4.0

DocsContributingTwitter

Build status Scorecard Latest release Discord Contributors


Understand, fix, and automate across your codebase with Sourcegraph's code intelligence platform

 


4.0 Features

  • Understand usage and search structure with high-level aggregations of search results
  • A faster, simpler search experience
  • Configure precise code navigation for 9 languages (Ruby, Rust, Go, Java, Scala, Kotlin, Python, TypeScript, JavaScript) in a matter of minutes with auto-indexing
  • Your favorite extensions are now available by default
  • Quickly access answers within your codebase with a revamped reference panel

🏗️ High-leverage ways to improve your entire codebase

  • Make changes across all of your codebase at enterprise scale with server-side Batch Changes (beta)
    • Run large-scale or resource-intensive batch changes without clogging your local machine
    • Run large batch changes quickly by distributing them across an autoscaled pool of compute instances
    • Get a better debugging experience with the streaming of logs directly into Sourcegraph.

☁️ Dedicated Sourcegraph Cloud instances for enterprise

  • Sourcegraph Cloud now offers dedicated, single-tenant instances of Sourcegraph

📈 Advanced admin capabilities

  • Save time upgrading to Sourcegraph 4.0 with multi-version upgrades
  • View usage and measure the value of our platform with new and enhanced in-product analytics
  • Uncover developer time saved using Browser and IDE extensions
  • Easily export traces using OpenTelemetry
  • Quickly see the status on your repository and permissions syncing
  • Measure precise code navigation coverage with an enhanced analytics dashboard

Deploy Sourcegraph

Self-hosted

Local machine

Development

Refer to the Developing Sourcegraph guide to get started.

Documentation

The doc directory has additional documentation for developing and understanding Sourcegraph:

License

This repository contains both OSS-licensed and non-OSS-licensed files. We maintain one repository rather than two separate repositories mainly for development convenience.

All files in the enterprise and client/web/src/enterprise fall under LICENSE.enterprise.

The remaining files fall under the Apache 2 license. Sourcegraph OSS is built only from the Apache-licensed files in this repository.