sourcegraph/internal/database
Rok Novosel f77f0272cf
embeddings: searcher and indexer (#48017)
# High-level architecture overview
<img width="2231" alt="Screenshot 2023-02-24 at 15 13 59"
src="https://user-images.githubusercontent.com/6417322/221200130-53c1ff25-4c47-4532-885f-5c4f9dadb05e.png">


# Embeddings

Really quickly: embeddings are a semantic representation of text.
Embeddings are usually floating-point vectors with 256+ elements. The
neat thing about embeddings is that they allow us to search over textual
information using a semantic correlation between the query and the text,
not just syntactic (matching keywords).

In this PR, we implemented an embedding service that will allow us to do
semantic code search over repositories in Sourcegraph. So, for example,
you'll be able to ask, "how do access tokens work in Sourcegraph", and
it will give you a list of the closest matching code files.

Additionally, we build a context detection service powered by
embeddings. In chat applications, it is important to know whether the
user's message requires additional context. We have to differentiate
between two cases: the user asks a general question about the codebase,
or the user references something in the existing conversation. In the
latter case, including the context would ruin the flow of the
conversation, and the chatbot would most likely return a confusing
answer. We determine whether a query _does not_ require additional
context using two approaches:

1. We check if the query contains well-known phrases that would indicate
the user is referencing the existing conversation (e.g., translate
previous, change that)
1. We have a static dataset of messages that require context and a
dataset of messages that do not. We embed both datasets, and then, using
embedding similarity, we can check which set is more similar to the
query.

## GraphQL API

We add four new resolvers to the GraphQL API:

```graphql
extend type Query {
  embeddingsSearch(repo: ID!, query: String!, codeResultsCount: Int!, textResultsCount: Int!): EmbeddingsSearchResults!
  isContextRequiredForQuery(query: String!): Boolean!
}
extend type Mutation {
  scheduleRepositoriesForEmbedding(repoNames: [String!]!): EmptyResponse!
  scheduleContextDetectionForEmbedding: EmptyResponse!
}
```

- `embeddingsSearch` performs embeddings search over the repo embeddings
and returns the specified number of results
- `isContextRequiredForQuery` determines whether the given query
requires additional context
- `scheduleRepositoriesForEmbedding` schedules a repo embedding
background job
- `scheduleContextDetectionForEmbedding` schedules a context detection
embedding background job that embeds a static dataset of messages.

## Repo embedding background job

Embedding a repository is implemented as a background job. The
background job handler receives the repository and the revision, which
should be embedded. Handler then gathers a list of files from the
gitserver and excludes files >1MB in size. The list of files is split
into code and text files (.md, .txt), and we build a separate embedding
index for both. We split them because in a combined index, the text
files always tended to feature as top results and didn't leave any room
for code files. Once we have the list of files, the procedure is as
follows:

- For each file
  - Get file contents from gitserver
- Check if the file is embeddable (is not autogenerated, is large
enough, does not have long lines)
  - Split the file into embeddable chunks
- Embed the file chunks using an external embedding service (defined in
site config)
  - Add embedded file chunks and metadata to the index
- Metadata contains the file name, the start line, and the end line of
the chunk
- Once all files are processed, the index is marshaled into JSON and
stored in Cloud storage (GCS, S3)

### Site config changes

As mentioned, we use a configurable external embedding API that does the
actual text -> vector embedding part. Ideally, this allows us to swap
embedding providers in the future.

```json
"embeddings": {
  "description": "Configuration for embeddings service.",
  "type": "object",
  "required": ["enabled", "dimensions", "model", "accessToken", "url"],
  "properties": {
    "enabled": {
      "description": "Toggles whether embedding service is enabled.",
      "type": "boolean",
      "default": false
    },
    "dimensions": {
      "description": "The dimensionality of the embedding vectors.",
      "type": "integer",
      "minimum": 0
    },
    "model": {
      "description": "The model used for embedding.",
      "type": "string"
    },
    "accessToken": {
      "description": "The access token used to authenticate with the external embedding API service.",
      "type": "string"
    },
    "url": {
      "description": "The url to the external embedding API service.",
      "type": "string",
      "format": "uri"
    }
  }
}
```

## Repo embeddings search

The repo embeddings search is implemented in its own service. When a
user queries a repo using embeddings search, the following happens:

- Download the repo embedding index from blob storage and cache it in
memory
  - We cache up to 5 embedding indexes in memory
- Embed the query and use the embedded query vector to find similar code
and text file metadata in the embedding index
- Query gitserver for the actual file contents
- Return the results

## Interesting files

- [Similarity
search](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-102cc83520004eb0e2795e49bc435c5142ca555189b1db3a52bbf1ffb82fa3c6)
- [Repo embedding job
handler](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-c345f373f426398beb4b9cd5852ba862a2718687882db2a8b2d9c7fbb5f1dc52)
- [External embedding api
client](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-ad1e7956f518e4bcaee17dd9e7ac04a5e090c00d970fcd273919e887e1d2cf8f)
- [Embedding a
repo](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-1f35118727128095b7816791b6f0a2e0e060cddee43d25102859b8159465585c)
- [Embeddings searcher
service](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-5b20f3e7ef87041daeeaef98b58ebf7388519cedcdfc359dc5e6d4e0b021472e)
- [Embeddings
search](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-79f95b9cc3f1ef39c1a0b88015bd9cd6c19c30a8d4c147409f1b8e8cd9462ea1)
- [Repo embedding index cache
management](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-8a41f7dec31054889dbf86e97c52223d5636b4d408c6b375bcfc09160a8b70f8)
- [GraphQL
resolvers](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-9b30a0b5efcb63e2f4611b99ab137fbe09629a769a4f30d10a1b2da41a01d21f)


## Test plan

- Start by filling out the `embeddings` object in the site config (let
me know if you need an API key)
- Start the embeddings service using `sg start embeddings`
- Go to the `/api/console` page and schedule a repo embedding job and a
context detection embedding job:

```graphql
mutation {
  scheduleRepositoriesForEmbedding(repoNames: ["github.com/sourcegraph/handbook"]) {
    __typename
  }
  scheduleContextDetectionForEmbedding {
    __typename
  }
}
```

- Once both are finished, you should be able to query the repo embedding
index, and determine whether context is need for a given query:

```graphql
query {
  isContextRequiredForQuery(query: "how do access tokens work")
  embeddingsSearch(
    repo: "UmVwb3NpdG9yeToy", # github.com/sourcegraph/handbook GQL ID
    query: "how do access tokens work", 
    codeResultsCount: 5,
    textResultsCount: 5) {
    codeResults {
      fileName
      content
    }
    textResults {
      fileName
      content
    }
  }
}
```
2023-03-01 10:50:12 +01:00
..
basestore basestore: rearrange keyed collection scanner type parameters (#47554) 2023-02-13 20:04:25 +00:00
batch Packages: expose listing of package repo information in graphql (#47105) 2023-02-02 18:54:03 +00:00
connections dbtest: Fix missing parameter (#48153) 2023-02-23 21:25:09 +00:00
dbcache bazel: introduce build files for Go (#46770) 2023-01-23 14:00:01 +01:00
dbconn internal/database: avoid potential performance issue in db driver (#48219) 2023-02-28 18:11:31 -08:00
dbtest Housekeeping: Rename variables to avoid collisions with packages (#47179) 2023-01-31 16:28:43 +01:00
dbutil bazel: introduce build files for Go (#46770) 2023-01-23 14:00:01 +01:00
fakedb teams: Adjust schema for better external user mapping (#47143) 2023-02-24 00:37:02 +01:00
locker bazel: introduce build files for Go (#46770) 2023-01-23 14:00:01 +01:00
migration Fix stitched migration (#48058) 2023-02-22 18:00:16 +00:00
postgresdsn Housekeeping: Rename variables to avoid collisions with packages (#47179) 2023-01-31 16:28:43 +01:00
access_requests_test.go Add "request access" experimental feature (#47741) 2023-02-27 13:25:44 +03:00
access_requests.go Add "request access" experimental feature (#47741) 2023-02-27 13:25:44 +03:00
access_tokens_test.go Add security events for access token deletions (#43680) 2022-11-03 09:54:27 +00:00
access_tokens.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
authenticator_test.go Standardize database encryption (#40050) 2022-08-08 20:29:09 +00:00
authenticator.go encryption: Lazily decrypt batch_changes_site_credentials (#40228) 2022-08-15 15:58:13 +00:00
authz.go admin-analytics: add user administration backend API endpoints (#39926) 2022-08-05 17:18:21 +06:00
bitbucket_project_permissions_test.go Move Log Execution out of workerutil (#46742) 2023-01-26 09:58:29 -07:00
bitbucket_project_permissions.go internal/database: Remove unused transact func (#47312) 2023-02-02 11:41:30 +00:00
BUILD.bazel bazel: fix //enterprise/internal/... (#47846) 2023-02-18 12:07:49 -05:00
CODENOTIFY chore: update Codenotify subscriptions for Joe (#46978) 2023-01-26 11:27:00 +00:00
conf_test.go internal/database: Scan redacted_contents from critical_and_site_config (#47079) 2023-01-30 16:16:43 +05:30
conf.go permissions-center: add pagination support for permission sync jobs store. (#47350) 2023-02-03 10:26:37 +04:00
database_test.go add db.WithTransact (#46796) 2023-01-23 16:23:24 -07:00
database.go Add "request access" experimental feature (#47741) 2023-02-27 13:25:44 +03:00
dbstore_db_test.go Remove dbtesting package (#28426) 2021-12-02 08:24:03 -07:00
doc.go db: rename internal/db package to internal/database (#17607) 2021-01-25 20:24:15 +04:00
encryption_tables.go database: support choice between ordered and unordered maps in keyed collection scanner (reducer) (#47029) 2023-01-27 17:49:20 +00:00
encryption_test.go Standardize database encryption (#40050) 2022-08-08 20:29:09 +00:00
encryption_utils.go Standardize database encryption (#40050) 2022-08-08 20:29:09 +00:00
encryption.go Standardize database encryption (#40050) 2022-08-08 20:29:09 +00:00
err_test.go all: use any instead of interface{} (#35102) 2022-05-09 10:59:39 +02:00
event_logs_test.go Make CreateUserAndSave return a user (#47315) 2023-02-03 09:17:51 +01:00
event_logs.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
executor_secret_access_logs_test.go codeintel: first implementation of auto-indexing secrets (#45580) 2022-12-15 22:32:16 +00:00
executor_secret_access_logs.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
executor_secrets_test.go Implement UI components for executor secrets (#43942) 2022-11-10 20:33:42 +01:00
executor_secrets.go Database: replace Transact() with WithTransact() on ExecutorSecretStore (#47070) 2023-01-30 10:24:09 -07:00
executors_test.go executors: Undo service-ification for now (#43001) 2022-10-14 19:00:57 +01:00
executors.go Executor Job Specific Tokens (#46792) 2023-02-28 18:40:22 +00:00
external_accounts_test.go rbac: assign roles when a user is created (#47406) 2023-02-15 16:47:23 +01:00
external_accounts.go rbac: create users and assign role in single transaction (#47700) 2023-02-17 15:07:46 +01:00
external_services_test.go Implement repo syncing for Azure DevOps (#46746) 2023-01-23 19:28:21 +00:00
external_services.go Implement repo syncing for Azure DevOps (#46746) 2023-01-23 19:28:21 +00:00
feature_flags_test.go Feature Flags: reduce DB roundtrips (#45747) 2022-12-15 16:23:21 -07:00
feature_flags.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
gen.go sg migration: Remove schemadoc (#35905) 2022-05-24 16:19:19 +00:00
gen.sh migration: Improve drift schema fetching (#37043) 2022-06-27 16:06:42 -05:00
github_app_helper.go internal/database/github_app_helper.go: Indent code (#43999) 2022-11-07 17:39:09 +05:30
gitserver_localclone_jobs_test.go logging(gitstart): migrate internal/database to lib/log (#36466) 2022-06-20 23:04:01 +00:00
gitserver_localclone_jobs.go dbconn: Modify query text to tag source, other metadata (#42588) 2022-10-13 12:34:37 +00:00
gitserver_repos_test.go repos: log corruption events per repo (#45667) 2023-01-09 17:24:32 +02:00
gitserver_repos.go Housekeeping: Rename variables to avoid collisions with packages (#47179) 2023-01-31 16:28:43 +01:00
global_state_test.go global state: Rewrite store (#38781) 2022-07-15 09:56:53 -05:00
global_state.go dbconn: Modify query text to tag source, other metadata (#42588) 2022-10-13 12:34:37 +00:00
helpers.go chore: clean up pagination related code (#47369) 2023-02-07 15:28:17 +00:00
main_test.go deps: upgrade github.com/sourcegraph/log (#41058) 2022-09-01 08:01:59 -07:00
mockerr.go SCIM: Implement user creation (#47573) 2023-02-15 20:21:28 +01:00
mocks_temp.go permissions-center: expose perms sync jobs filtering. (#48265) 2023-02-27 16:08:29 +01:00
namespace_permissions_test.go rbac: remove action field from namespace_permissions (#47500) 2023-02-16 01:04:09 +01:00
namespace_permissions.go rbac: remove action field from namespace_permissions (#47500) 2023-02-16 01:04:09 +01:00
namespaces_test.go logging(gitstart): migrate internal/database to lib/log (#36466) 2022-06-20 23:04:01 +00:00
namespaces.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
oauth_token_helper_test.go Add refreshable token interface for GitHub and GitLab (#42629) 2022-10-17 16:30:37 +02:00
oauth_token_helper.go Add refreshable token interface for GitHub and GitLab (#42629) 2022-10-17 16:30:37 +02:00
org_invitations_test.go logging(gitstart): migrate internal/database to lib/log (#36466) 2022-06-20 23:04:01 +00:00
org_invitations.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
org_members_db_test.go logging(gitstart): migrate internal/database to lib/log (#36466) 2022-06-20 23:04:01 +00:00
org_members.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
orgs_test.go debt: Remove NamespaceUserID and NamespaceOrgID from external services (#44992) 2022-12-05 12:36:02 +01:00
orgs.go Implement database stores for teams (#46936) 2023-01-27 19:24:15 +00:00
outbound_webhook_jobs_test.go database: add outbound webhook tables (#46007) 2023-01-17 16:08:40 -08:00
outbound_webhook_jobs.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
outbound_webhook_logs_test.go Housekeeping: Rename variables to avoid collisions with packages (#47179) 2023-01-31 16:28:43 +01:00
outbound_webhook_logs.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
outbound_webhooks_test.go database: add outbound webhook tables (#46007) 2023-01-17 16:08:40 -08:00
outbound_webhooks.go database: add outbound webhook tables (#46007) 2023-01-17 16:08:40 -08:00
permission_sync_code_host_state.go permissions-center: use PermissionSyncCodeHostState type from database package. (#47730) 2023-02-17 08:19:41 +01:00
permission_sync_jobs_test.go permissions-center: support sync job search by user name/display name. (#48352) 2023-02-28 14:53:28 +00:00
permission_sync_jobs.go permissions-center: support sync job search by user name/display name. (#48352) 2023-02-28 14:53:28 +00:00
permissions_test.go rbac: make pagination optional for. permissions (#48118) 2023-02-24 13:37:07 +01:00
permissions.go rbac: make pagination optional for. permissions (#48118) 2023-02-24 13:37:07 +01:00
phabricator_test.go Add tests for our Phabricator store (#41570) 2022-09-19 11:26:37 +02:00
phabricator.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
redis_key_value_test.go redispool: use postgres for redispool.Store in App (#47188) 2023-02-03 12:59:05 +00:00
redis_key_value.go redispool: use postgres for redispool.Store in App (#47188) 2023-02-03 12:59:05 +00:00
repo_kvps_test.go Search backend: add support for arbitrary repo key-value pairs (#40112) 2022-08-11 15:17:33 +00:00
repo_kvps.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
repo_statistics_test.go repos: Add corrupted to repos statistics (#46410) 2023-01-16 16:36:23 +02:00
repo_statistics.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
repos_perm_test.go [feat] use unified permissions in authzQuery (#48263) 2023-02-28 09:52:17 +01:00
repos_perm.go [feat] use unified permissions in authzQuery (#48263) 2023-02-28 09:52:17 +01:00
repos_test.go graphql: add var to filter repos by corruption (#46414) 2023-01-16 20:16:43 +02:00
repos.go Improve list user permissions func on perms store (#48339) 2023-02-28 14:04:57 +00:00
role_permissions_test.go rbac: always upsert role assignment and permission assignment (#47780) 2023-02-17 13:07:47 +01:00
role_permissions.go rbac: always upsert role assignment and permission assignment (#47780) 2023-02-17 13:07:47 +01:00
roles_test.go rbac: create users and assign role in single transaction (#47700) 2023-02-17 15:07:46 +01:00
roles.go rbac: remove deleted_at from roles column (#47411) 2023-02-09 04:24:39 +01:00
saved_searches_test.go logging(gitstart): migrate internal/database to lib/log (#36466) 2022-06-20 23:04:01 +00:00
saved_searches.go permissions-center: add pagination support for permission sync jobs store. (#47350) 2023-02-03 10:26:37 +04:00
schema.codeinsights.json insights: allow to save the default number of series samples (#47329) 2023-02-08 12:24:08 +00:00
schema.codeinsights.md insights: allow to save the default number of series samples (#47329) 2023-02-08 12:24:08 +00:00
schema.codeintel.json codeintel: Clean up triggers (#45681) 2022-12-15 08:58:38 -06:00
schema.codeintel.md codeintel: Clean up triggers (#45681) 2022-12-15 08:58:38 -06:00
schema.json embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
schema.md embeddings: searcher and indexer (#48017) 2023-03-01 10:50:12 +01:00
search_contexts_test.go search contexts: allow setting a default context and use it on load (#45387) 2022-12-12 09:39:53 -08:00
search_contexts.go search contexts: allow setting a default context and use it on load (#45387) 2022-12-12 09:39:53 -08:00
security_event_logs_test.go security_event_logs: set AnonymousUserID to internal for internal actors (#45309) 2022-12-07 08:32:46 -08:00
security_event_logs.go azureoauth: Add azure devops oauth provider (#47805) 2023-02-20 07:21:15 +00:00
settings_test.go remove many removed and unused fields from settings and site config schemas (#46045) 2023-01-02 12:12:30 -10:00
settings.go Revive #42039 (#43168) 2022-10-19 17:57:01 +02:00
survey_responses_test.go NPS Survey: Update use cases to be a free-form field (#38387) 2022-07-08 14:13:42 +01:00
survey_responses.go NPS Survey: Update use cases to be a free-form field (#38387) 2022-07-08 14:13:42 +01:00
teams_test.go teams: Adjust schema for better external user mapping (#47143) 2023-02-24 00:37:02 +01:00
teams.go teams: Adjust schema for better external user mapping (#47143) 2023-02-24 00:37:02 +01:00
temporary_settings_test.go logging(gitstart): migrate internal/database to lib/log (#36466) 2022-06-20 23:04:01 +00:00
temporary_settings.go database: remove dbutil-based constructors (#36210) 2022-05-30 23:48:20 +00:00
test_util.go SCIM: Implement user creation (#47573) 2023-02-15 20:21:28 +01:00
testing.go all: use any instead of interface{} (#35102) 2022-05-09 10:59:39 +02:00
user_credentials_test.go Housekeeping: Rename variables to avoid collisions with packages (#47179) 2023-01-31 16:28:43 +01:00
user_credentials.go Database: convert Transact to WithTransact for stores that don't use their transactions (#47119) 2023-01-31 08:45:05 -07:00
user_emails_test.go useremail: Remove dotcom special cases (#44814) 2022-12-06 13:01:17 +01:00
user_emails.go useremail: Remove dotcom special cases (#44814) 2022-12-06 13:01:17 +01:00
user_roles_test.go rbac: create users and assign role in single transaction (#47700) 2023-02-17 15:07:46 +01:00
user_roles.go rbac: always upsert role assignment and permission assignment (#47780) 2023-02-17 13:07:47 +01:00
users_builtin_auth_test.go Allow users signed up with SSO to create password (#47417) 2023-02-09 20:34:58 +06:00
users_test.go teams: Adjust schema for better external user mapping (#47143) 2023-02-24 00:37:02 +01:00
users.go SCIM: Add PATCH and PUT (#48291) 2023-02-28 17:34:46 +01:00
webhook_logs_test.go Database: replace DB.Transact()/DB.Done() with DB.WithTransact() (#47071) 2023-01-30 22:17:13 -07:00
webhook_logs.go Add webhook logging to new webhooks handler (#43446) 2022-10-27 09:58:37 -06:00
webhooks_test.go Update the webhooks update GraphQL endpoint (#45207) 2022-12-07 10:42:27 +02:00
webhooks.go Remove all unused code specified by linter warnings (#47015) 2023-01-27 13:17:36 +02:00
zoekt_repos_test.go Add zoekt_repos table, populate it in worker, add endpoint for Zoekt to update (#43289) 2022-10-27 14:07:19 +02:00
zoekt_repos.go Add zoekt_repos table, populate it in worker, add endpoint for Zoekt to update (#43289) 2022-10-27 14:07:19 +02:00