embeddings: searcher and indexer (#48017)

# High-level architecture overview
<img width="2231" alt="Screenshot 2023-02-24 at 15 13 59"
src="https://user-images.githubusercontent.com/6417322/221200130-53c1ff25-4c47-4532-885f-5c4f9dadb05e.png">


# Embeddings

Really quickly: embeddings are a semantic representation of text.
Embeddings are usually floating-point vectors with 256+ elements. The
neat thing about embeddings is that they allow us to search over textual
information using a semantic correlation between the query and the text,
not just syntactic (matching keywords).

In this PR, we implemented an embedding service that will allow us to do
semantic code search over repositories in Sourcegraph. So, for example,
you'll be able to ask, "how do access tokens work in Sourcegraph", and
it will give you a list of the closest matching code files.

Additionally, we build a context detection service powered by
embeddings. In chat applications, it is important to know whether the
user's message requires additional context. We have to differentiate
between two cases: the user asks a general question about the codebase,
or the user references something in the existing conversation. In the
latter case, including the context would ruin the flow of the
conversation, and the chatbot would most likely return a confusing
answer. We determine whether a query _does not_ require additional
context using two approaches:

1. We check if the query contains well-known phrases that would indicate
the user is referencing the existing conversation (e.g., translate
previous, change that)
1. We have a static dataset of messages that require context and a
dataset of messages that do not. We embed both datasets, and then, using
embedding similarity, we can check which set is more similar to the
query.

## GraphQL API

We add four new resolvers to the GraphQL API:

```graphql
extend type Query {
  embeddingsSearch(repo: ID!, query: String!, codeResultsCount: Int!, textResultsCount: Int!): EmbeddingsSearchResults!
  isContextRequiredForQuery(query: String!): Boolean!
}
extend type Mutation {
  scheduleRepositoriesForEmbedding(repoNames: [String!]!): EmptyResponse!
  scheduleContextDetectionForEmbedding: EmptyResponse!
}
```

- `embeddingsSearch` performs embeddings search over the repo embeddings
and returns the specified number of results
- `isContextRequiredForQuery` determines whether the given query
requires additional context
- `scheduleRepositoriesForEmbedding` schedules a repo embedding
background job
- `scheduleContextDetectionForEmbedding` schedules a context detection
embedding background job that embeds a static dataset of messages.

## Repo embedding background job

Embedding a repository is implemented as a background job. The
background job handler receives the repository and the revision, which
should be embedded. Handler then gathers a list of files from the
gitserver and excludes files >1MB in size. The list of files is split
into code and text files (.md, .txt), and we build a separate embedding
index for both. We split them because in a combined index, the text
files always tended to feature as top results and didn't leave any room
for code files. Once we have the list of files, the procedure is as
follows:

- For each file
  - Get file contents from gitserver
- Check if the file is embeddable (is not autogenerated, is large
enough, does not have long lines)
  - Split the file into embeddable chunks
- Embed the file chunks using an external embedding service (defined in
site config)
  - Add embedded file chunks and metadata to the index
- Metadata contains the file name, the start line, and the end line of
the chunk
- Once all files are processed, the index is marshaled into JSON and
stored in Cloud storage (GCS, S3)

### Site config changes

As mentioned, we use a configurable external embedding API that does the
actual text -> vector embedding part. Ideally, this allows us to swap
embedding providers in the future.

```json
"embeddings": {
  "description": "Configuration for embeddings service.",
  "type": "object",
  "required": ["enabled", "dimensions", "model", "accessToken", "url"],
  "properties": {
    "enabled": {
      "description": "Toggles whether embedding service is enabled.",
      "type": "boolean",
      "default": false
    },
    "dimensions": {
      "description": "The dimensionality of the embedding vectors.",
      "type": "integer",
      "minimum": 0
    },
    "model": {
      "description": "The model used for embedding.",
      "type": "string"
    },
    "accessToken": {
      "description": "The access token used to authenticate with the external embedding API service.",
      "type": "string"
    },
    "url": {
      "description": "The url to the external embedding API service.",
      "type": "string",
      "format": "uri"
    }
  }
}
```

## Repo embeddings search

The repo embeddings search is implemented in its own service. When a
user queries a repo using embeddings search, the following happens:

- Download the repo embedding index from blob storage and cache it in
memory
  - We cache up to 5 embedding indexes in memory
- Embed the query and use the embedded query vector to find similar code
and text file metadata in the embedding index
- Query gitserver for the actual file contents
- Return the results

## Interesting files

- [Similarity
search](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-102cc83520004eb0e2795e49bc435c5142ca555189b1db3a52bbf1ffb82fa3c6)
- [Repo embedding job
handler](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-c345f373f426398beb4b9cd5852ba862a2718687882db2a8b2d9c7fbb5f1dc52)
- [External embedding api
client](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-ad1e7956f518e4bcaee17dd9e7ac04a5e090c00d970fcd273919e887e1d2cf8f)
- [Embedding a
repo](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-1f35118727128095b7816791b6f0a2e0e060cddee43d25102859b8159465585c)
- [Embeddings searcher
service](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-5b20f3e7ef87041daeeaef98b58ebf7388519cedcdfc359dc5e6d4e0b021472e)
- [Embeddings
search](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-79f95b9cc3f1ef39c1a0b88015bd9cd6c19c30a8d4c147409f1b8e8cd9462ea1)
- [Repo embedding index cache
management](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-8a41f7dec31054889dbf86e97c52223d5636b4d408c6b375bcfc09160a8b70f8)
- [GraphQL
resolvers](https://github.com/sourcegraph/sourcegraph/pull/48017/files#diff-9b30a0b5efcb63e2f4611b99ab137fbe09629a769a4f30d10a1b2da41a01d21f)


## Test plan

- Start by filling out the `embeddings` object in the site config (let
me know if you need an API key)
- Start the embeddings service using `sg start embeddings`
- Go to the `/api/console` page and schedule a repo embedding job and a
context detection embedding job:

```graphql
mutation {
  scheduleRepositoriesForEmbedding(repoNames: ["github.com/sourcegraph/handbook"]) {
    __typename
  }
  scheduleContextDetectionForEmbedding {
    __typename
  }
}
```

- Once both are finished, you should be able to query the repo embedding
index, and determine whether context is need for a given query:

```graphql
query {
  isContextRequiredForQuery(query: "how do access tokens work")
  embeddingsSearch(
    repo: "UmVwb3NpdG9yeToy", # github.com/sourcegraph/handbook GQL ID
    query: "how do access tokens work", 
    codeResultsCount: 5,
    textResultsCount: 5) {
    codeResults {
      fileName
      content
    }
    textResults {
      fileName
      content
    }
  }
}
```
This commit is contained in:
Rok Novosel 2023-03-01 10:50:12 +01:00 committed by GitHub
parent 6bf58cccca
commit f77f0272cf
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
78 changed files with 4409 additions and 46 deletions

View File

@ -59,6 +59,7 @@ type Services struct {
ComputeResolver graphqlbackend.ComputeResolver
InsightsAggregationResolver graphqlbackend.InsightsAggregationResolver
WebhooksResolver graphqlbackend.WebhooksResolver
EmbeddingsResolver graphqlbackend.EmbeddingsResolver
RBACResolver graphqlbackend.RBACResolver
OwnResolver graphqlbackend.OwnResolver
}

View File

@ -0,0 +1,42 @@
package graphqlbackend
import (
"context"
"github.com/graph-gophers/graphql-go"
)
type EmbeddingsResolver interface {
EmbeddingsSearch(ctx context.Context, args EmbeddingsSearchInputArgs) (EmbeddingsSearchResultsResolver, error)
IsContextRequiredForChatQuery(ctx context.Context, args IsContextRequiredForChatQueryInputArgs) (bool, error)
ScheduleRepositoriesForEmbedding(ctx context.Context, args ScheduleRepositoriesForEmbeddingArgs) (*EmptyResponse, error)
ScheduleContextDetectionForEmbedding(ctx context.Context) (*EmptyResponse, error)
}
type ScheduleRepositoriesForEmbeddingArgs struct {
RepoNames []string
}
type IsContextRequiredForChatQueryInputArgs struct {
Query string
}
type EmbeddingsSearchInputArgs struct {
Repo graphql.ID
Query string
CodeResultsCount int32
TextResultsCount int32
}
type EmbeddingsSearchResultsResolver interface {
CodeResults(ctx context.Context) []EmbeddingsSearchResultResolver
TextResults(ctx context.Context) []EmbeddingsSearchResultResolver
}
type EmbeddingsSearchResultResolver interface {
FileName(ctx context.Context) string
StartLine(ctx context.Context) int32
EndLine(ctx context.Context) int32
Content(ctx context.Context) string
}

View File

@ -0,0 +1,79 @@
extend type Query {
"""
Experimental: Searches a repository for similar code and text results using embeddings.
We separated code and text results because text results tended to always feature at the top of the combined results,
and didn't leave room for the code.
"""
embeddingsSearch(
"""
The repository to search.
"""
repo: ID!
"""
The query used for embeddings search.
"""
query: String!
"""
The number of code results to return.
"""
codeResultsCount: Int!
"""
The number of text results to return. Text results contain Markdown files and similar file types primarily used for writing documentation.
"""
textResultsCount: Int!
): EmbeddingsSearchResults!
"""
Experimental: Determines whether the given query requires further context before it can be answered.
For example:
- "What are Sourcegraph Notebooks" requires additional information from the Sourcegraph repository (Notebooks Markdown docs, etc.).
- "Translate the previous code to Typescript" does not need additional context since it is referring to the existing context (or conversation).
"""
isContextRequiredForChatQuery(query: String!): Boolean!
}
extend type Mutation {
"""
Experimental: Schedules a job to create an embedding search index for each listed repository. The indices are used for embeddings search.
"""
scheduleRepositoriesForEmbedding(repoNames: [String!]!): EmptyResponse!
"""
Experimental: Schedules a job to create an embedding index used for context detection. The index is used to determine wheter a query requires additional context.
"""
scheduleContextDetectionForEmbedding: EmptyResponse!
}
"""
A single embeddings search result.
"""
type EmbeddingsSearchResult {
"""
The search result file name.
"""
fileName: String!
"""
The start line of the content (inclusive).
"""
startLine: Int!
"""
The end line of the content (exclusive).
"""
endLine: Int!
"""
The content of the file from start line to end line.
"""
content: String!
}
"""
Embeddings search results. Contains a list of code results and a list of text results.
"""
type EmbeddingsSearchResults {
"""
A list of code file results.
"""
codeResults: [EmbeddingsSearchResult!]!
"""
A list of text file results.
"""
textResults: [EmbeddingsSearchResult!]!
}

View File

@ -3,7 +3,7 @@ package graphqlbackend
import (
"testing"
"github.com/sourcegraph/sourcegraph/cmd/frontend/internal/highlight"
"github.com/sourcegraph/sourcegraph/internal/binary"
)
func TestIsBinary(t *testing.T) {
@ -66,7 +66,7 @@ func TestIsBinary(t *testing.T) {
}
for _, tst := range tests {
t.Run(tst.name, func(t *testing.T) {
got := highlight.IsBinary(tst.input)
got := binary.IsBinary(tst.input)
if got != tst.want {
t.Fatalf("got %v want %v", got, tst.want)
}

View File

@ -18,6 +18,7 @@ import (
"github.com/sourcegraph/sourcegraph/cmd/frontend/internal/highlight"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/authz"
"github.com/sourcegraph/sourcegraph/internal/binary"
"github.com/sourcegraph/sourcegraph/internal/cloneurls"
resolverstubs "github.com/sourcegraph/sourcegraph/internal/codeintel/resolvers"
"github.com/sourcegraph/sourcegraph/internal/database"
@ -163,7 +164,7 @@ func (r *GitTreeEntryResolver) Binary(ctx context.Context) (bool, error) {
if err != nil {
return false, err
}
return highlight.IsBinary(r.fullContentBytes), nil
return binary.IsBinary(r.fullContentBytes), nil
}
func (r *GitTreeEntryResolver) Highlight(ctx context.Context, args *HighlightArgs) (*HighlightedFileResolver, error) {

View File

@ -382,39 +382,39 @@ func prometheusGraphQLRequestName(requestName string) string {
}
func NewSchemaWithoutResolvers(db database.DB) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
}
func NewSchemaWithNotebooksResolver(db database.DB, notebooks NotebooksResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, notebooks, nil, nil, nil, nil, nil)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, notebooks, nil, nil, nil, nil, nil, nil)
}
func NewSchemaWithAuthzResolver(db database.DB, authz AuthzResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, authz, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, authz, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
}
func NewSchemaWithBatchChangesResolver(db database.DB, batchChanges BatchChangesResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), batchChanges, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
return NewSchema(db, gitserver.NewClient(), batchChanges, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
}
func NewSchemaWithCodeMonitorsResolver(db database.DB, codeMonitors CodeMonitorsResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, codeMonitors, nil, nil, nil, nil, nil, nil, nil, nil, nil)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, codeMonitors, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
}
func NewSchemaWithLicenseResolver(db database.DB, license LicenseResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, license, nil, nil, nil, nil, nil, nil, nil, nil)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, license, nil, nil, nil, nil, nil, nil, nil, nil, nil)
}
func NewSchemaWithWebhooksResolver(db database.DB, webhooksResolver WebhooksResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, webhooksResolver, nil, nil)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, webhooksResolver, nil, nil, nil)
}
func NewSchemaWithRBACResolver(db database.DB, rbacResolver RBACResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, rbacResolver, nil)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, rbacResolver, nil)
}
func NewSchemaWithOwnResolver(db database.DB, own OwnResolver) (*graphql.Schema, error) {
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, own)
return NewSchema(db, gitserver.NewClient(), nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, own)
}
func NewSchema(
@ -432,6 +432,7 @@ func NewSchema(
compute ComputeResolver,
insightsAggregation InsightsAggregationResolver,
webhooksResolver WebhooksResolver,
embeddingsResolver EmbeddingsResolver,
rbacResolver RBACResolver,
ownResolver OwnResolver,
) (*graphql.Schema, error) {
@ -538,6 +539,12 @@ func NewSchema(
}
}
if embeddingsResolver != nil {
EnterpriseResolvers.embeddingsResolver = embeddingsResolver
resolver.EmbeddingsResolver = embeddingsResolver
schemas = append(schemas, embeddingsSchema)
}
if rbacResolver != nil {
EnterpriseResolvers.rbacResolver = rbacResolver
resolver.RBACResolver = rbacResolver
@ -599,6 +606,7 @@ type schemaResolver struct {
NotebooksResolver
InsightsAggregationResolver
WebhooksResolver
EmbeddingsResolver
RBACResolver
OwnResolver
}
@ -703,6 +711,7 @@ var EnterpriseResolvers = struct {
notebooksResolver NotebooksResolver
InsightsAggregationResolver InsightsAggregationResolver
webhooksResolver WebhooksResolver
embeddingsResolver EmbeddingsResolver
rbacResolver RBACResolver
ownResolver OwnResolver
}{}

View File

@ -69,6 +69,11 @@ var insightsAggregationsSchema string
//go:embed outbound_webhooks.graphql
var outboundWebhooksSchema string
// embeddingsSchema is the Embeddings raw graqhql schema.
//
//go:embed embeddings.graphql
var embeddingsSchema string
// rbacSchema is the RBAC raw graphql schema.
//
//go:embed rbac.graphql

View File

@ -25,7 +25,7 @@ func mustParseGraphQLSchema(t *testing.T, db database.DB) *graphql.Schema {
func mustParseGraphQLSchemaWithClient(t *testing.T, db database.DB, gitserverClient gitserver.Client) *graphql.Schema {
t.Helper()
parsedSchema, parseSchemaErr := NewSchema(db, gitserverClient, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
parsedSchema, parseSchemaErr := NewSchema(db, gitserverClient, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil)
if parseSchemaErr != nil {
t.Fatal(parseSchemaErr)
}

View File

@ -12,6 +12,8 @@ import (
"github.com/sourcegraph/sourcegraph/cmd/frontend/graphqlbackend/externallink"
"github.com/sourcegraph/sourcegraph/cmd/frontend/internal/highlight"
"github.com/sourcegraph/sourcegraph/internal/binary"
)
// FileContentFunc is a closure that returns the contents of a file and is used by the VirtualFileResolver.
@ -94,7 +96,7 @@ func (r *VirtualFileResolver) Binary(ctx context.Context) (bool, error) {
if err != nil {
return false, err
}
return highlight.IsBinary([]byte(content)), nil
return binary.IsBinary([]byte(content)), nil
}
var highlightHistogram = promauto.NewHistogram(prometheus.HistogramOpts{

View File

@ -564,6 +564,12 @@ func serviceConnections(logger log.Logger) conftypes.ServiceConnections {
logger.Error("failed to get zoekt endpoints for service connections", log.Error(err))
}
embeddingsMap := computeEmbeddingsEndpoints()
embeddingsAddrs, err := embeddingsMap.Endpoints()
if err != nil {
logger.Error("failed to get embeddings endpoints for service connections", log.Error(err))
}
return conftypes.ServiceConnections{
GitServers: gitAddrs,
PostgresDSN: serviceConnectionsVal.PostgresDSN,
@ -571,6 +577,7 @@ func serviceConnections(logger log.Logger) conftypes.ServiceConnections {
CodeInsightsDSN: serviceConnectionsVal.CodeInsightsDSN,
Searchers: searcherAddrs,
Symbols: symbolsAddrs,
Embeddings: embeddingsAddrs,
Zoekts: zoektAddrs,
ZoektListTTL: indexedListTTL,
}
@ -586,6 +593,9 @@ var (
indexedEndpointsOnce sync.Once
indexedEndpoints *endpoint.Map
embeddingsURLsOnce sync.Once
embeddingsURLs *endpoint.Map
indexedListTTL = func() time.Duration {
ttl, _ := time.ParseDuration(env.Get("SRC_INDEXED_SEARCH_LIST_CACHE_TTL", "", "Indexed search list cache TTL"))
if ttl == 0 {
@ -626,6 +636,33 @@ func symbolsAddr(environ []string) (string, error) {
return "http://symbols:3184", nil
}
func computeEmbeddingsEndpoints() *endpoint.Map {
embeddingsURLsOnce.Do(func() {
addr, err := embeddingsAddr(os.Environ())
if err != nil {
embeddingsURLs = endpoint.Empty(errors.Wrap(err, "failed to parse EMBEDDINGS_URL"))
} else {
embeddingsURLs = endpoint.New(addr)
}
})
return embeddingsURLs
}
func embeddingsAddr(environ []string) (string, error) {
const (
serviceName = "embeddings"
port = "9991"
)
if addr, ok := getEnv(environ, "EMBEDDINGS_URL"); ok {
addrs, err := replicaAddrs(deploy.Type(), addr, serviceName, port)
return addrs, err
}
// Not set, use the default (non-service discovery on embeddings)
return "http://embeddings:9991", nil
}
func LoadConfig() {
highlight.LoadConfig()
symbols.LoadConfig()

View File

@ -217,6 +217,7 @@ func Main(ctx context.Context, observationCtx *observation.Context, ready servic
enterpriseServices.ComputeResolver,
enterpriseServices.InsightsAggregationResolver,
enterpriseServices.WebhooksResolver,
enterpriseServices.EmbeddingsResolver,
enterpriseServices.RBACResolver,
enterpriseServices.OwnResolver,
)

View File

@ -6,13 +6,11 @@ import (
"encoding/base64"
"fmt"
"html/template"
"net/http"
"path"
"path/filepath"
"strings"
"sync"
"time"
"unicode/utf8"
"github.com/inconshreveable/log15"
otlog "github.com/opentracing/opentracing-go/log"
@ -23,6 +21,7 @@ import (
"golang.org/x/net/html/atom"
"google.golang.org/protobuf/proto"
"github.com/sourcegraph/sourcegraph/internal/binary"
"github.com/sourcegraph/sourcegraph/internal/conf/deploy"
"github.com/sourcegraph/sourcegraph/internal/honey"
@ -66,18 +65,6 @@ func getHighlightOp() *observation.Operation {
return highlightOp
}
// IsBinary is a helper to tell if the content of a file is binary or not.
// TODO(tjdevries): This doesn't make sense to be here, IMO
func IsBinary(content []byte) bool {
// We first check if the file is valid UTF8, since we always consider that
// to be non-binary.
//
// Secondly, if the file is not valid UTF8, we check if the detected HTTP
// content type is text, which covers a whole slew of other non-UTF8 text
// encodings for us.
return !utf8.Valid(content) && !strings.HasPrefix(http.DetectContentType(content), "text/")
}
// Params defines mandatory and optional parameters to use when highlighting
// code.
type Params struct {
@ -384,7 +371,7 @@ func Code(ctx context.Context, p Params) (response *HighlightedCode, aborted boo
}
// Never pass binary files to the syntax highlighter.
if IsBinary(p.Content) {
if binary.IsBinary(p.Content) {
return nil, false, ErrBinary
}
code := string(p.Content)

View File

@ -23,6 +23,7 @@ allowed_prefix=(
github.com/sourcegraph/sourcegraph/enterprise/cmd/migrator
github.com/sourcegraph/sourcegraph/enterprise/cmd/precise-code-intel-
github.com/sourcegraph/sourcegraph/enterprise/cmd/symbols
github.com/sourcegraph/sourcegraph/enterprise/cmd/embeddings
# Doesn't connect but uses db internals for use with sqlite
github.com/sourcegraph/sourcegraph/cmd/symbols
# Transitively depends on zoekt package which imports but does not use DB

View File

@ -79,8 +79,8 @@ The default run type.
- Pipeline for `DockerImages` changes:
- **Metadata**: Pipeline metadata
- **Test builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build server, Build sg
- **Scan test builds**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan sg
- **Test builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build embeddings, Build server, Build sg
- **Scan test builds**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan embeddings, Scan sg
- Upload build trace
- Pipeline for `WolfiPackages` changes:
@ -178,8 +178,8 @@ Base pipeline (more steps might be included based on branch changes):
- **Metadata**: Pipeline metadata
- **Pipeline setup**: Trigger async
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build server, Build sg, Build executor image, Build executor binary, Build docker registry mirror image
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan sg
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build embeddings, Build server, Build sg, Build executor image, Build executor binary, Build docker registry mirror image
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan embeddings, Scan sg
- **Linters and static analysis**: GraphQL lint, Run sg lint
- **Client checks**: Puppeteer tests prep, Puppeteer tests for chrome extension, Puppeteer tests chunk #1, Puppeteer tests chunk #2, Puppeteer tests chunk #3, Puppeteer tests chunk #4, Puppeteer tests chunk #5, Puppeteer tests chunk #6, Puppeteer tests chunk #7, Puppeteer tests chunk #8, Puppeteer tests chunk #9, Puppeteer tests chunk #10, Puppeteer tests chunk #11, Puppeteer tests chunk #12, Puppeteer tests chunk #13, Puppeteer tests chunk #14, Puppeteer tests chunk #15, Upload Storybook to Chromatic, Test (all), Build, Enterprise build, Test (client/web), Test (client/browser), Test (client/jetbrains), Build TS, ESLint (all), Stylelint (all)
- **Go checks**: Test (all), Test (all (gRPC)), Test (enterprise/internal/insights), Test (enterprise/internal/insights (gRPC)), Test (internal/repos), Test (internal/repos (gRPC)), Test (enterprise/internal/batches), Test (enterprise/internal/batches (gRPC)), Test (cmd/frontend), Test (cmd/frontend (gRPC)), Test (enterprise/cmd/frontend/internal/batches/resolvers), Test (enterprise/cmd/frontend/internal/batches/resolvers (gRPC)), Test (dev/sg), Test (dev/sg (gRPC)), Test (internal/database), Test (enterprise/internal/database), Build
@ -187,7 +187,7 @@ Base pipeline (more steps might be included based on branch changes):
- **CI script tests**: test-trace-command.sh
- **Integration tests**: Backend integration tests (gRPC), Backend integration tests, Code Intel QA
- **End-to-end tests**: Executors E2E, Sourcegraph E2E, Sourcegraph QA, Sourcegraph Cluster (deploy-sourcegraph) QA, Sourcegraph Upgrade
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, server, sg, Publish executor image, Publish executor binary, Publish docker registry mirror image
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, embeddings, server, sg, Publish executor image, Publish executor binary, Publish docker registry mirror image
- Upload build trace
### Release branch
@ -198,8 +198,8 @@ Base pipeline (more steps might be included based on branch changes):
- **Metadata**: Pipeline metadata
- **Pipeline setup**: Trigger async
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build server, Build sg, Build executor image, Build executor binary, Build docker registry mirror image
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan sg
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build embeddings, Build server, Build sg, Build executor image, Build executor binary, Build docker registry mirror image
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan embeddings, Scan sg
- **Linters and static analysis**: GraphQL lint, Run sg lint
- **Client checks**: Puppeteer tests prep, Puppeteer tests for chrome extension, Puppeteer tests chunk #1, Puppeteer tests chunk #2, Puppeteer tests chunk #3, Puppeteer tests chunk #4, Puppeteer tests chunk #5, Puppeteer tests chunk #6, Puppeteer tests chunk #7, Puppeteer tests chunk #8, Puppeteer tests chunk #9, Puppeteer tests chunk #10, Puppeteer tests chunk #11, Puppeteer tests chunk #12, Puppeteer tests chunk #13, Puppeteer tests chunk #14, Puppeteer tests chunk #15, Upload Storybook to Chromatic, Test (all), Build, Enterprise build, Test (client/web), Test (client/browser), Test (client/jetbrains), Build TS, ESLint (all), Stylelint (all)
- **Go checks**: Test (all), Test (all (gRPC)), Test (enterprise/internal/insights), Test (enterprise/internal/insights (gRPC)), Test (internal/repos), Test (internal/repos (gRPC)), Test (enterprise/internal/batches), Test (enterprise/internal/batches (gRPC)), Test (cmd/frontend), Test (cmd/frontend (gRPC)), Test (enterprise/cmd/frontend/internal/batches/resolvers), Test (enterprise/cmd/frontend/internal/batches/resolvers (gRPC)), Test (dev/sg), Test (dev/sg (gRPC)), Test (internal/database), Test (enterprise/internal/database), Build
@ -207,7 +207,7 @@ Base pipeline (more steps might be included based on branch changes):
- **CI script tests**: test-trace-command.sh
- **Integration tests**: Backend integration tests (gRPC), Backend integration tests, Code Intel QA
- **End-to-end tests**: Executors E2E, Sourcegraph E2E, Sourcegraph QA, Sourcegraph Cluster (deploy-sourcegraph) QA, Sourcegraph Upgrade
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, server, sg
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, embeddings, server, sg
- Upload build trace
### Browser extension release build
@ -247,8 +247,8 @@ Base pipeline (more steps might be included based on branch changes):
- **Metadata**: Pipeline metadata
- **Pipeline setup**: Trigger async
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build server, Build sg, Build executor image, Build executor binary
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan sg
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build embeddings, Build server, Build sg, Build executor image, Build executor binary
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan embeddings, Scan sg
- **Linters and static analysis**: GraphQL lint, Run sg lint
- **Client checks**: Puppeteer tests prep, Puppeteer tests for chrome extension, Puppeteer tests chunk #1, Puppeteer tests chunk #2, Puppeteer tests chunk #3, Puppeteer tests chunk #4, Puppeteer tests chunk #5, Puppeteer tests chunk #6, Puppeteer tests chunk #7, Puppeteer tests chunk #8, Puppeteer tests chunk #9, Puppeteer tests chunk #10, Puppeteer tests chunk #11, Puppeteer tests chunk #12, Puppeteer tests chunk #13, Puppeteer tests chunk #14, Puppeteer tests chunk #15, Upload Storybook to Chromatic, Test (all), Build, Enterprise build, Test (client/web), Test (client/browser), Test (client/jetbrains), Build TS, ESLint (all), Stylelint (all)
- **Go checks**: Test (all), Test (all (gRPC)), Test (enterprise/internal/insights), Test (enterprise/internal/insights (gRPC)), Test (internal/repos), Test (internal/repos (gRPC)), Test (enterprise/internal/batches), Test (enterprise/internal/batches (gRPC)), Test (cmd/frontend), Test (cmd/frontend (gRPC)), Test (enterprise/cmd/frontend/internal/batches/resolvers), Test (enterprise/cmd/frontend/internal/batches/resolvers (gRPC)), Test (dev/sg), Test (dev/sg (gRPC)), Test (internal/database), Test (enterprise/internal/database), Build
@ -256,7 +256,7 @@ Base pipeline (more steps might be included based on branch changes):
- **CI script tests**: test-trace-command.sh
- **Integration tests**: Backend integration tests (gRPC), Backend integration tests, Code Intel QA
- **End-to-end tests**: Executors E2E, Sourcegraph E2E, Sourcegraph QA, Sourcegraph Cluster (deploy-sourcegraph) QA, Sourcegraph Upgrade
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, server, sg, Publish executor image, Publish executor binary
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, embeddings, server, sg, Publish executor image, Publish executor binary
- Upload build trace
### Main dry run
@ -272,8 +272,8 @@ Base pipeline (more steps might be included based on branch changes):
- **Metadata**: Pipeline metadata
- **Pipeline setup**: Trigger async
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build server, Build sg, Build executor image, Build executor binary
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan sg
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build embeddings, Build server, Build sg, Build executor image, Build executor binary
- **Image security scans**: Scan alpine-3.14, Scan cadvisor, Scan codeinsights-db, Scan codeintel-db, Scan frontend, Scan github-proxy, Scan gitserver, Scan grafana, Scan indexed-searcher, Scan jaeger-agent, Scan jaeger-all-in-one, Scan blobstore2, Scan node-exporter, Scan postgres-12-alpine, Scan postgres_exporter, Scan precise-code-intel-worker, Scan prometheus, Scan prometheus-gcp, Scan redis-cache, Scan redis-store, Scan redis_exporter, Scan repo-updater, Scan search-indexer, Scan searcher, Scan symbols, Scan syntax-highlighter, Scan worker, Scan migrator, Scan executor, Scan executor-vm, Scan batcheshelper, Scan opentelemetry-collector, Scan embeddings, Scan sg
- **Linters and static analysis**: GraphQL lint, Run sg lint
- **Client checks**: Puppeteer tests prep, Puppeteer tests for chrome extension, Puppeteer tests chunk #1, Puppeteer tests chunk #2, Puppeteer tests chunk #3, Puppeteer tests chunk #4, Puppeteer tests chunk #5, Puppeteer tests chunk #6, Puppeteer tests chunk #7, Puppeteer tests chunk #8, Puppeteer tests chunk #9, Puppeteer tests chunk #10, Puppeteer tests chunk #11, Puppeteer tests chunk #12, Puppeteer tests chunk #13, Puppeteer tests chunk #14, Puppeteer tests chunk #15, Upload Storybook to Chromatic, Test (all), Build, Enterprise build, Test (client/web), Test (client/browser), Test (client/jetbrains), Build TS, ESLint (all), Stylelint (all)
- **Go checks**: Test (all), Test (all (gRPC)), Test (enterprise/internal/insights), Test (enterprise/internal/insights (gRPC)), Test (internal/repos), Test (internal/repos (gRPC)), Test (enterprise/internal/batches), Test (enterprise/internal/batches (gRPC)), Test (cmd/frontend), Test (cmd/frontend (gRPC)), Test (enterprise/cmd/frontend/internal/batches/resolvers), Test (enterprise/cmd/frontend/internal/batches/resolvers (gRPC)), Test (dev/sg), Test (dev/sg (gRPC)), Test (internal/database), Test (enterprise/internal/database), Build
@ -281,7 +281,7 @@ Base pipeline (more steps might be included based on branch changes):
- **CI script tests**: test-trace-command.sh
- **Integration tests**: Backend integration tests (gRPC), Backend integration tests, Code Intel QA
- **End-to-end tests**: Executors E2E, Sourcegraph E2E, Sourcegraph QA, Sourcegraph Cluster (deploy-sourcegraph) QA, Sourcegraph Upgrade
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, server, sg
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, embeddings, server, sg
- Upload build trace
### Patch image
@ -314,8 +314,8 @@ sg ci build docker-images-candidates-notest
Base pipeline (more steps might be included based on branch changes):
- **Metadata**: Pipeline metadata
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build server, Build sg
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, server, sg
- **Image builds**: Build alpine-3.14, Build cadvisor, Build codeinsights-db, Build codeintel-db, Build frontend, Build github-proxy, Build gitserver, Build grafana, Build indexed-searcher, Build jaeger-agent, Build jaeger-all-in-one, Build blobstore, Build blobstore2, Build node-exporter, Build postgres-12-alpine, Build postgres_exporter, Build precise-code-intel-worker, Build prometheus, Build prometheus-gcp, Build redis-cache, Build redis-store, Build redis_exporter, Build repo-updater, Build search-indexer, Build searcher, Build symbols, Build syntax-highlighter, Build worker, Build migrator, Build executor, Build executor-vm, Build batcheshelper, Build opentelemetry-collector, Build embeddings, Build server, Build sg
- **Publish images**: alpine-3.14, cadvisor, codeinsights-db, codeintel-db, frontend, github-proxy, gitserver, grafana, indexed-searcher, jaeger-agent, jaeger-all-in-one, blobstore, blobstore2, node-exporter, postgres-12-alpine, postgres_exporter, precise-code-intel-worker, prometheus, prometheus-gcp, redis-cache, redis-store, redis_exporter, repo-updater, search-indexer, searcher, symbols, syntax-highlighter, worker, migrator, executor, executor-vm, batcheshelper, opentelemetry-collector, embeddings, server, sg
- Upload build trace
### Build executor without testing

View File

@ -34,6 +34,7 @@ Available comamndsets in `sg.config.yaml`:
* batches 🦡
* codeintel
* dotcom
* embeddings
* enterprise
* enterprise-codeinsights
* enterprise-codeintel 🧠
@ -97,6 +98,7 @@ Available commands in `sg.config.yaml`:
* codeintel-worker
* debug-env: Debug env vars
* docsite: Docsite instance serving the docs
* embeddings
* executor-template
* frontend: Enterprise frontend
* github-proxy

View File

@ -0,0 +1,19 @@
FROM sourcegraph/alpine-3.14:196830_2023-02-01_af83eee939ca@sha256:b4d7040d41fcf37fbf96fe5a14c39ae15580a3a6c76355cc7ea04a74b6c3b9fa
ARG COMMIT_SHA="unknown"
ARG DATE="unknown"
ARG VERSION="unknown"
LABEL org.opencontainers.image.revision=${COMMIT_SHA}
LABEL org.opencontainers.image.created=${DATE}
LABEL org.opencontainers.image.version=${VERSION}
LABEL com.sourcegraph.github.url=https://github.com/sourcegraph/sourcegraph/commit/${COMMIT_SHA}
RUN apk add --no-cache \
bash
USER sourcegraph
EXPOSE 9991
WORKDIR /
ENTRYPOINT ["/sbin/tini", "--", "/usr/local/bin/embeddings"]
COPY embeddings /usr/local/bin/

View File

@ -0,0 +1,26 @@
#!/usr/bin/env bash
# We want to build multiple go binaries, so we use a custom build step on CI.
cd "$(dirname "${BASH_SOURCE[0]}")/../../.."
set -ex
OUTPUT=$(mktemp -d -t sgdockerbuild_XXXXXXX)
cleanup() {
rm -rf "$OUTPUT"
}
trap cleanup EXIT
# Environment for building linux binaries
export GO111MODULE=on
export GOARCH=amd64
export GOOS=linux
export CGO_ENABLED=0
pkg="github.com/sourcegraph/sourcegraph/enterprise/cmd/embeddings"
go build -trimpath -ldflags "-X github.com/sourcegraph/sourcegraph/internal/version.version=$VERSION -X github.com/sourcegraph/sourcegraph/internal/version.timestamp=$(date +%s)" -buildmode exe -tags dist -o "$OUTPUT/$(basename $pkg)" "$pkg"
docker build -f enterprise/cmd/embeddings/Dockerfile -t "$IMAGE" "$OUTPUT" \
--progress=plain \
--build-arg COMMIT_SHA \
--build-arg DATE \
--build-arg VERSION

View File

@ -0,0 +1,10 @@
package main
import (
"github.com/sourcegraph/sourcegraph/enterprise/cmd/embeddings/shared"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/sourcegraph/enterprisecmd"
)
func main() {
enterprisecmd.DeprecatedSingleServiceMainEnterprise(shared.Service)
}

View File

@ -0,0 +1,25 @@
package shared
import (
"github.com/sourcegraph/sourcegraph/lib/errors"
emb "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/internal/env"
)
type Config struct {
env.BaseConfig
EmbeddingsUploadStoreConfig *emb.EmbeddingsUploadStoreConfig
}
func (c *Config) Load() {
c.EmbeddingsUploadStoreConfig = &emb.EmbeddingsUploadStoreConfig{}
c.EmbeddingsUploadStoreConfig.Load()
}
func (c *Config) Validate() error {
var errs error
errs = errors.Append(errs, c.EmbeddingsUploadStoreConfig.Validate())
return errs
}

View File

@ -0,0 +1,77 @@
package shared
import (
"context"
"strings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/internal/lazyregexp"
)
type getContextDetectionEmbeddingIndexFn func(ctx context.Context) (*embeddings.ContextDetectionEmbeddingIndex, error)
const MIN_NO_CONTEXT_SIMILARITY_DIFF = float32(0.02)
const MIN_QUERY_WITH_CONTEXT_LENGTH = 16
var NO_CONTEXT_MESSAGES_REGEXPS = []*lazyregexp.Regexp{
lazyregexp.New(`(previous|above)\s+(message|code|text)`),
lazyregexp.New(
`(translate|convert|change|for|make|refactor|rewrite|ignore|explain|fix|try|show)\s+(that|this|above|previous|it|again)`,
),
lazyregexp.New(
`(this|that).*?\s+(is|seems|looks)\s+(wrong|incorrect|bad|good)`,
),
lazyregexp.New(`^(yes|no|correct|wrong|nope|yep|now|cool)(\s|.|,)`),
// User provided their own code context in the form of a Markdown code block.
lazyregexp.New("```"),
}
func isContextRequiredForChatQuery(
ctx context.Context,
getQueryEmbedding getQueryEmbeddingFn,
getContextDetectionEmbeddingIndex getContextDetectionEmbeddingIndexFn,
query string,
) (bool, error) {
queryTrimmed := strings.TrimSpace(query)
if len(queryTrimmed) < MIN_QUERY_WITH_CONTEXT_LENGTH {
return false, nil
}
queryLower := strings.ToLower(queryTrimmed)
for _, regexp := range NO_CONTEXT_MESSAGES_REGEXPS {
if submatches := regexp.FindStringSubmatch(queryLower); len(submatches) > 0 {
return false, nil
}
}
isSimilarToNoContextMessages, err := isQuerySimilarToNoContextMessages(ctx, getQueryEmbedding, getContextDetectionEmbeddingIndex, queryTrimmed)
if err != nil {
return false, err
}
// If the query is similar to messages that require context, then we can assume context is required for the query.
return !isSimilarToNoContextMessages, nil
}
func isQuerySimilarToNoContextMessages(
ctx context.Context,
getQueryEmbedding getQueryEmbeddingFn,
getContextDetectionEmbeddingIndex getContextDetectionEmbeddingIndexFn,
query string,
) (bool, error) {
contextDetectionEmbeddingIndex, err := getContextDetectionEmbeddingIndex(ctx)
if err != nil {
return false, err
}
queryEmbedding, err := getQueryEmbedding(query)
if err != nil {
return false, err
}
messagesWithContextSimilarity := embeddings.CosineSimilarity(contextDetectionEmbeddingIndex.MessagesWithAdditionalContextMeanEmbedding, queryEmbedding)
messagesWithoutContextSimilarity := embeddings.CosineSimilarity(contextDetectionEmbeddingIndex.MessagesWithoutAdditionalContextMeanEmbedding, queryEmbedding)
// We have to be really sure that the query is similar to no context messages, so we include the `MIN_NO_CONTEXT_SIMILARITY_DIFF` threshold.
isSimilarToNoContextMessages := (messagesWithoutContextSimilarity - messagesWithContextSimilarity) >= MIN_NO_CONTEXT_SIMILARITY_DIFF
return isSimilarToNoContextMessages, nil
}

View File

@ -0,0 +1,26 @@
package shared
import (
"context"
"sync"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/internal/uploadstore"
)
func getCachedContextDetectionEmbeddingIndex(uploadStore uploadstore.Store) getContextDetectionEmbeddingIndexFn {
mu := sync.Mutex{}
var contextDetectionEmbeddingIndex *embeddings.ContextDetectionEmbeddingIndex = nil
return func(ctx context.Context) (_ *embeddings.ContextDetectionEmbeddingIndex, err error) {
mu.Lock()
defer mu.Unlock()
if contextDetectionEmbeddingIndex != nil {
return contextDetectionEmbeddingIndex, nil
}
contextDetectionEmbeddingIndex, err = downloadJSONFile[embeddings.ContextDetectionEmbeddingIndex](ctx, uploadStore, embeddings.CONTEXT_DETECTION_INDEX_NAME)
if err != nil {
return nil, err
}
return contextDetectionEmbeddingIndex, nil
}
}

View File

@ -0,0 +1,22 @@
package shared
import (
"context"
"encoding/json"
"github.com/sourcegraph/sourcegraph/internal/uploadstore"
)
func downloadJSONFile[T any](ctx context.Context, uploadStore uploadstore.Store, key string) (*T, error) {
file, err := uploadStore.Get(ctx, key)
if err != nil {
return nil, err
}
var jsonFile T
err = json.NewDecoder(file).Decode(&jsonFile)
if err != nil {
return nil, err
}
return &jsonFile, nil
}

View File

@ -0,0 +1,207 @@
package shared
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"net/http"
"time"
"github.com/sourcegraph/log"
"github.com/sourcegraph/sourcegraph/cmd/frontend/globals"
"github.com/sourcegraph/sourcegraph/lib/errors"
eiauthz "github.com/sourcegraph/sourcegraph/enterprise/internal/authz"
srp "github.com/sourcegraph/sourcegraph/enterprise/internal/authz/subrepoperms"
edb "github.com/sourcegraph/sourcegraph/enterprise/internal/database"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/embed"
"github.com/sourcegraph/sourcegraph/internal/actor"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/authz"
"github.com/sourcegraph/sourcegraph/internal/conf"
"github.com/sourcegraph/sourcegraph/internal/conf/conftypes"
"github.com/sourcegraph/sourcegraph/internal/database"
connections "github.com/sourcegraph/sourcegraph/internal/database/connections/live"
"github.com/sourcegraph/sourcegraph/internal/gitserver"
"github.com/sourcegraph/sourcegraph/internal/goroutine"
"github.com/sourcegraph/sourcegraph/internal/honey"
"github.com/sourcegraph/sourcegraph/internal/httpserver"
"github.com/sourcegraph/sourcegraph/internal/instrumentation"
"github.com/sourcegraph/sourcegraph/internal/observation"
"github.com/sourcegraph/sourcegraph/internal/service"
"github.com/sourcegraph/sourcegraph/internal/trace"
)
const addr = ":9991"
func Main(ctx context.Context, observationCtx *observation.Context, ready service.ReadyFunc, config *Config) error {
logger := observationCtx.Logger
// Initialize tracing/metrics
observationCtx = observation.NewContext(logger, observation.Honeycomb(&honey.Dataset{
Name: "embeddings",
SampleRate: 20,
}))
// Initialize main DB connection.
sqlDB := mustInitializeFrontendDB(observationCtx)
db := database.NewDB(logger, sqlDB)
go setAuthzProviders(ctx, db)
repoStore := db.Repos()
repoEmbeddingJobsStore := repo.NewRepoEmbeddingJobsStore(db)
// Run setup
gitserverClient := gitserver.NewClient()
uploadStore, err := embeddings.NewEmbeddingsUploadStore(ctx, observationCtx, config.EmbeddingsUploadStoreConfig)
if err != nil {
return err
}
authz.DefaultSubRepoPermsChecker, err = srp.NewSubRepoPermsClient(edb.NewEnterpriseDB(db).SubRepoPerms())
if err != nil {
return errors.Wrap(err, "creating sub-repo client")
}
readFile := func(ctx context.Context, repoName api.RepoName, revision api.CommitID, fileName string) ([]byte, error) {
return gitserverClient.ReadFile(ctx, authz.DefaultSubRepoPermsChecker, repoName, revision, fileName)
}
getRepoEmbeddingIndex, err := getCachedRepoEmbeddingIndex(repoStore, repoEmbeddingJobsStore, func(ctx context.Context, repoEmbeddingIndexName embeddings.RepoEmbeddingIndexName) (*embeddings.RepoEmbeddingIndex, error) {
return downloadJSONFile[embeddings.RepoEmbeddingIndex](ctx, uploadStore, string(repoEmbeddingIndexName))
})
if err != nil {
return err
}
client := embed.NewEmbeddingsClient()
getQueryEmbedding, err := getCachedQueryEmbeddingFn(client)
if err != nil {
return err
}
getContextDetectionEmbeddingIndex := getCachedContextDetectionEmbeddingIndex(uploadStore)
// Create HTTP server
handler := NewHandler(ctx, readFile, getRepoEmbeddingIndex, getQueryEmbedding, getContextDetectionEmbeddingIndex)
handler = handlePanic(logger, handler)
handler = trace.HTTPMiddleware(logger, handler, conf.DefaultClient())
handler = instrumentation.HTTPMiddleware("", handler)
handler = actor.HTTPMiddleware(logger, handler)
server := httpserver.NewFromAddr(addr, &http.Server{
ReadTimeout: 75 * time.Second,
WriteTimeout: 10 * time.Minute,
Handler: handler,
})
// Mark health server as ready and go!
ready()
goroutine.MonitorBackgroundRoutines(ctx, server)
return nil
}
func NewHandler(
ctx context.Context,
readFile readFileFn,
getRepoEmbeddingIndex getRepoEmbeddingIndexFn,
getQueryEmbedding getQueryEmbeddingFn,
getContextDetectionEmbeddingIndex getContextDetectionEmbeddingIndexFn,
) http.Handler {
// Initialize the legacy JSON API server
mux := http.NewServeMux()
mux.HandleFunc("/search", func(w http.ResponseWriter, r *http.Request) {
if r.Method != "POST" {
http.Error(w, fmt.Sprintf("unsupported method %s", r.Method), http.StatusBadRequest)
return
}
var args embeddings.EmbeddingsSearchParameters
err := json.NewDecoder(r.Body).Decode(&args)
if err != nil {
http.Error(w, "could not parse request body", http.StatusBadRequest)
return
}
res, err := searchRepoEmbeddingIndex(ctx, args, readFile, getRepoEmbeddingIndex, getQueryEmbedding)
if err != nil {
http.Error(w, "error searching embedding index", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(res)
})
mux.HandleFunc("/isContextRequiredForChatQuery", func(w http.ResponseWriter, r *http.Request) {
if r.Method != "POST" {
http.Error(w, fmt.Sprintf("unsupported method %s", r.Method), http.StatusBadRequest)
return
}
var args embeddings.IsContextRequiredForChatQueryParameters
err := json.NewDecoder(r.Body).Decode(&args)
if err != nil {
http.Error(w, "could not parse request body", http.StatusBadRequest)
return
}
isRequired, err := isContextRequiredForChatQuery(ctx, getQueryEmbedding, getContextDetectionEmbeddingIndex, args.Query)
if err != nil {
http.Error(w, "error detecting if context is required for query", http.StatusInternalServerError)
return
}
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(embeddings.IsContextRequiredForChatQueryResult{IsRequired: isRequired})
})
return mux
}
func mustInitializeFrontendDB(observationCtx *observation.Context) *sql.DB {
dsn := conf.GetServiceConnectionValueAndRestartOnChange(func(serviceConnections conftypes.ServiceConnections) string {
return serviceConnections.PostgresDSN
})
db, err := connections.EnsureNewFrontendDB(observationCtx, dsn, "embeddings")
if err != nil {
observationCtx.Logger.Fatal("failed to connect to database", log.Error(err))
}
return db
}
// SetAuthzProviders periodically refreshes the global authz providers. This changes the repositories that are visible for reads based on the
// current actor stored in an operation's context, which is likely an internal actor for many of
// the jobs configured in this service. This also enables repository update operations to fetch
// permissions from code hosts.
func setAuthzProviders(ctx context.Context, db database.DB) {
// authz also relies on UserMappings being setup.
globals.WatchPermissionsUserMapping()
for range time.NewTicker(eiauthz.RefreshInterval()).C {
allowAccessByDefault, authzProviders, _, _, _ := eiauthz.ProvidersFromConfig(ctx, conf.Get(), db.ExternalServices(), db)
authz.SetProviders(allowAccessByDefault, authzProviders)
}
}
func handlePanic(logger log.Logger, next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
defer func() {
if rec := recover(); rec != nil {
err := fmt.Sprintf("%v", rec)
http.Error(w, fmt.Sprintf("%v", rec), http.StatusInternalServerError)
logger.Error("recovered from panic", log.String("err", err))
}
}()
next.ServeHTTP(w, r)
})
}

View File

@ -0,0 +1,31 @@
package shared
import (
lru "github.com/hashicorp/golang-lru"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/embed"
"github.com/sourcegraph/sourcegraph/lib/errors"
)
const QUERY_EMBEDDING_RETRIES = 3
const QUERY_EMBEDDINGS_CACHE_MAX_ENTRIES = 128
func getCachedQueryEmbeddingFn(client embed.EmbeddingsClient) (getQueryEmbeddingFn, error) {
cache, err := lru.New(QUERY_EMBEDDINGS_CACHE_MAX_ENTRIES)
if err != nil {
return nil, errors.Wrap(err, "creating query embeddings cache")
}
return func(query string) (queryEmbedding []float32, err error) {
if cachedQueryEmbedding, ok := cache.Get(query); ok {
queryEmbedding = cachedQueryEmbedding.([]float32)
} else {
queryEmbedding, err = client.GetEmbeddingsWithRetries([]string{query}, QUERY_EMBEDDING_RETRIES)
if err != nil {
return nil, err
}
cache.Add(query, queryEmbedding)
}
return queryEmbedding, err
}, nil
}

View File

@ -0,0 +1,72 @@
package shared
import (
"context"
"time"
lru "github.com/hashicorp/golang-lru"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/database"
)
const REPO_EMBEDDING_INDEX_CACHE_MAX_ENTRIES = 5
type downloadRepoEmbeddingIndexFn func(ctx context.Context, repoEmbeddingIndexName embeddings.RepoEmbeddingIndexName) (*embeddings.RepoEmbeddingIndex, error)
type repoEmbeddingIndexCacheEntry struct {
index *embeddings.RepoEmbeddingIndex
finishedAt time.Time
}
func getCachedRepoEmbeddingIndex(
repoStore database.RepoStore,
repoEmbeddingJobsStore repo.RepoEmbeddingJobsStore,
downloadRepoEmbeddingIndex downloadRepoEmbeddingIndexFn,
) (getRepoEmbeddingIndexFn, error) {
cache, err := lru.New(REPO_EMBEDDING_INDEX_CACHE_MAX_ENTRIES)
if err != nil {
return nil, errors.Wrap(err, "creating repo embedding index cache")
}
getAndCacheIndex := func(ctx context.Context, repoEmbeddingIndexName embeddings.RepoEmbeddingIndexName, finishedAt *time.Time) (*embeddings.RepoEmbeddingIndex, error) {
embeddingIndex, err := downloadRepoEmbeddingIndex(ctx, repoEmbeddingIndexName)
if err != nil {
return nil, err
}
cache.Add(repoEmbeddingIndexName, repoEmbeddingIndexCacheEntry{index: embeddingIndex, finishedAt: *finishedAt})
return embeddingIndex, nil
}
return func(ctx context.Context, repoName api.RepoName) (*embeddings.RepoEmbeddingIndex, error) {
repo, err := repoStore.GetByName(ctx, repoName)
if err != nil {
return nil, err
}
lastFinishedRepoEmbeddingJob, err := repoEmbeddingJobsStore.GetLastCompletedRepoEmbeddingJob(ctx, repo.ID)
if err != nil {
return nil, err
}
repoEmbeddingIndexName := embeddings.GetRepoEmbeddingIndexName(repoName)
cacheEntry, ok := cache.Get(repoEmbeddingIndexName)
// Check if the index is in the cache.
if ok {
// Check if we have a newer finished embedding job. If so, download the new index, cache it, and return it instead.
repoEmbeddingIndexCacheEntry := cacheEntry.(repoEmbeddingIndexCacheEntry)
if lastFinishedRepoEmbeddingJob.FinishedAt.After(repoEmbeddingIndexCacheEntry.finishedAt) {
return getAndCacheIndex(ctx, repoEmbeddingIndexName, lastFinishedRepoEmbeddingJob.FinishedAt)
}
// Otherwise, return the cached index.
return repoEmbeddingIndexCacheEntry.index, nil
}
// We do not have the index in the cache. Download and cache it.
return getAndCacheIndex(ctx, repoEmbeddingIndexName, lastFinishedRepoEmbeddingJob.FinishedAt)
}, nil
}

View File

@ -0,0 +1,65 @@
package shared
import (
"context"
"testing"
"time"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/database"
"github.com/sourcegraph/sourcegraph/internal/types"
)
func TestGetCachedRepoEmbeddingIndex(t *testing.T) {
mockRepoEmbeddingJobsStore := repo.NewMockRepoEmbeddingJobsStore()
mockRepoStore := database.NewMockRepoStore()
mockRepoStore.GetByNameFunc.SetDefaultHook(func(ctx context.Context, name api.RepoName) (*types.Repo, error) { return &types.Repo{ID: 1}, nil })
finishedAt := time.Now()
mockRepoEmbeddingJobsStore.GetLastCompletedRepoEmbeddingJobFunc.SetDefaultHook(func(ctx context.Context, id api.RepoID) (*repo.RepoEmbeddingJob, error) {
return &repo.RepoEmbeddingJob{FinishedAt: &finishedAt}, nil
})
hasDownloadedRepoEmbeddingIndex := false
getRepoEmbeddingIndex, err := getCachedRepoEmbeddingIndex(mockRepoStore, mockRepoEmbeddingJobsStore, func(ctx context.Context, repoEmbeddingIndexName embeddings.RepoEmbeddingIndexName) (*embeddings.RepoEmbeddingIndex, error) {
hasDownloadedRepoEmbeddingIndex = true
return &embeddings.RepoEmbeddingIndex{}, nil
})
if err != nil {
t.Fatal(err)
}
ctx := context.Background()
// Initial request should download and cache the index.
_, err = getRepoEmbeddingIndex(ctx, api.RepoName("a"))
if err != nil {
t.Fatal(err)
}
if !hasDownloadedRepoEmbeddingIndex {
t.Fatal("expected to download the index on initial request")
}
// Subsequent requests should read from the cache.
hasDownloadedRepoEmbeddingIndex = false
_, err = getRepoEmbeddingIndex(ctx, api.RepoName("a"))
if err != nil {
t.Fatal(err)
}
if hasDownloadedRepoEmbeddingIndex {
t.Fatal("expected to not download the index on subsequent request")
}
// Simulate a newer completed repo embedding job.
finishedAt = finishedAt.Add(time.Hour)
_, err = getRepoEmbeddingIndex(ctx, api.RepoName("a"))
if err != nil {
t.Fatal(err)
}
if !hasDownloadedRepoEmbeddingIndex {
t.Fatal("expected to download the index after a newer embedding job is completed")
}
}

View File

@ -0,0 +1,89 @@
package shared
import (
"context"
"strings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/internal/api"
)
type readFileFn func(ctx context.Context, repoName api.RepoName, revision api.CommitID, fileName string) ([]byte, error)
type getRepoEmbeddingIndexFn func(ctx context.Context, repoName api.RepoName) (*embeddings.RepoEmbeddingIndex, error)
type getQueryEmbeddingFn func(query string) ([]float32, error)
func searchRepoEmbeddingIndex(
ctx context.Context,
params embeddings.EmbeddingsSearchParameters,
readFile readFileFn,
getRepoEmbeddingIndex getRepoEmbeddingIndexFn,
getQueryEmbedding getQueryEmbeddingFn,
) (*embeddings.EmbeddingSearchResults, error) {
embeddingIndex, err := getRepoEmbeddingIndex(ctx, params.RepoName)
if err != nil {
return nil, err
}
embeddedQuery, err := getQueryEmbedding(params.Query)
if err != nil {
return nil, err
}
var codeResults, textResults []embeddings.EmbeddingSearchResult
if params.CodeResultsCount > 0 && embeddingIndex.CodeIndex != nil {
codeResults = searchEmbeddingIndex(ctx, embeddingIndex.RepoName, embeddingIndex.Revision, embeddingIndex.CodeIndex, readFile, embeddedQuery, params.CodeResultsCount)
}
if params.TextResultsCount > 0 && embeddingIndex.TextIndex != nil {
textResults = searchEmbeddingIndex(ctx, embeddingIndex.RepoName, embeddingIndex.Revision, embeddingIndex.TextIndex, readFile, embeddedQuery, params.TextResultsCount)
}
return &embeddings.EmbeddingSearchResults{CodeResults: codeResults, TextResults: textResults}, nil
}
func searchEmbeddingIndex(
ctx context.Context,
repoName api.RepoName,
revision api.CommitID,
index *embeddings.EmbeddingIndex[embeddings.RepoEmbeddingRowMetadata],
readFile readFileFn,
query []float32,
nResults int,
) []embeddings.EmbeddingSearchResult {
rows := index.SimilaritySearch(query, nResults)
results := make([]embeddings.EmbeddingSearchResult, len(rows))
for idx, row := range rows {
fileContent, err := readFile(ctx, repoName, revision, row.FileName)
if err != nil {
continue
}
lines := strings.Split(string(fileContent), "\n")
// Sanity check: check that startLine and endLine are within 0 and len(lines).
startLine := max(0, min(len(lines), row.StartLine))
endLine := max(0, min(len(lines), row.EndLine))
results[idx] = embeddings.EmbeddingSearchResult{
FileName: row.FileName,
StartLine: row.StartLine,
EndLine: row.EndLine,
Content: strings.Join(lines[startLine:endLine], "\n"),
}
}
return results
}
func min(a, b int) int {
if a < b {
return a
}
return b
}
func max(a, b int) int {
if a > b {
return a
}
return b
}

View File

@ -0,0 +1,26 @@
package shared
import (
"context"
"github.com/sourcegraph/sourcegraph/internal/debugserver"
"github.com/sourcegraph/sourcegraph/internal/env"
"github.com/sourcegraph/sourcegraph/internal/observation"
"github.com/sourcegraph/sourcegraph/internal/service"
)
type svc struct{}
func (svc) Name() string { return "embeddings" }
func (svc) Configure() (env.Config, []debugserver.Endpoint) {
var config Config
config.Load()
return &config, nil
}
func (svc) Start(ctx context.Context, observationCtx *observation.Context, ready service.ReadyFunc, config env.Config) error {
return Main(ctx, observationCtx, ready, config.(*Config))
}
var Service service.Service = svc{}

View File

@ -0,0 +1,32 @@
package embeddings
import (
"context"
"github.com/sourcegraph/sourcegraph/cmd/frontend/enterprise"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/embeddings/resolvers"
"github.com/sourcegraph/sourcegraph/enterprise/internal/codeintel"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/contextdetection"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/internal/conf/conftypes"
"github.com/sourcegraph/sourcegraph/internal/database"
"github.com/sourcegraph/sourcegraph/internal/gitserver"
"github.com/sourcegraph/sourcegraph/internal/observation"
)
func Init(
ctx context.Context,
observationCtx *observation.Context,
db database.DB,
_ codeintel.Services,
_ conftypes.UnifiedWatchable,
enterpriseServices *enterprise.Services,
) error {
repoEmbeddingsStore := repo.NewRepoEmbeddingJobsStore(db)
contextDetectionEmbeddingsStore := contextdetection.NewContextDetectionEmbeddingJobsStore(db)
gitserverClient := gitserver.NewClient()
embeddingsClient := embeddings.NewClient()
enterpriseServices.EmbeddingsResolver = resolvers.NewResolver(db, gitserverClient, embeddingsClient, repoEmbeddingsStore, contextDetectionEmbeddingsStore)
return nil
}

View File

@ -0,0 +1,159 @@
package resolvers
import (
"context"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/sourcegraph/cmd/frontend/graphqlbackend"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
contextdetectionbg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/contextdetection"
repobg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/auth"
"github.com/sourcegraph/sourcegraph/internal/database"
"github.com/sourcegraph/sourcegraph/internal/gitserver"
)
func NewResolver(
db database.DB,
gitserverClient gitserver.Client,
embeddingsClient *embeddings.Client,
repoStore repobg.RepoEmbeddingJobsStore,
contextDetectionStore contextdetectionbg.ContextDetectionEmbeddingJobsStore,
) graphqlbackend.EmbeddingsResolver {
return &Resolver{
db: db,
gitserverClient: gitserverClient,
embeddingsClient: embeddingsClient,
repoEmbeddingJobsStore: repoStore,
contextDetectionJobsStore: contextDetectionStore,
}
}
type Resolver struct {
db database.DB
gitserverClient gitserver.Client
embeddingsClient *embeddings.Client
repoEmbeddingJobsStore repobg.RepoEmbeddingJobsStore
contextDetectionJobsStore contextdetectionbg.ContextDetectionEmbeddingJobsStore
}
func (r *Resolver) EmbeddingsSearch(ctx context.Context, args graphqlbackend.EmbeddingsSearchInputArgs) (graphqlbackend.EmbeddingsSearchResultsResolver, error) {
repoID, err := graphqlbackend.UnmarshalRepositoryID(args.Repo)
if err != nil {
return nil, err
}
repo, err := r.db.Repos().Get(ctx, repoID)
if err != nil {
return nil, err
}
results, err := r.embeddingsClient.Search(ctx, embeddings.EmbeddingsSearchParameters{
RepoName: repo.Name,
Query: args.Query,
CodeResultsCount: int(args.CodeResultsCount),
TextResultsCount: int(args.TextResultsCount),
})
if err != nil {
return nil, err
}
return &embeddingsSearchResultsResolver{results}, nil
}
func (r *Resolver) IsContextRequiredForChatQuery(ctx context.Context, args graphqlbackend.IsContextRequiredForChatQueryInputArgs) (bool, error) {
return r.embeddingsClient.IsContextRequiredForChatQuery(ctx, embeddings.IsContextRequiredForChatQueryParameters{Query: args.Query})
}
func (r *Resolver) ScheduleRepositoriesForEmbedding(ctx context.Context, args graphqlbackend.ScheduleRepositoriesForEmbeddingArgs) (_ *graphqlbackend.EmptyResponse, err error) {
// 🚨 SECURITY: Only site admins may schedule embedding jobs.
if err = auth.CheckCurrentUserIsSiteAdmin(ctx, r.db); err != nil {
return nil, err
}
tx, err := r.repoEmbeddingJobsStore.Transact(ctx)
if err != nil {
return nil, err
}
defer func() { err = tx.Done(err) }()
repoStore := r.db.Repos()
for _, repoName := range args.RepoNames {
// Scope the iteration to an anonymous function so we can capture all errors and properly rollback tx in defer above.
err = func() error {
repo, err := repoStore.GetByName(ctx, api.RepoName(repoName))
if err != nil {
return err
}
refName, latestRevision, err := r.gitserverClient.GetDefaultBranch(ctx, repo.Name, false)
if err != nil {
return err
}
if refName == "" {
return errors.Newf("could not get latest commit for repo %s", repo.Name)
}
// TODO: Check if repo + revision embedding job already exists and is not completed
_, err = tx.CreateRepoEmbeddingJob(ctx, repo.ID, latestRevision)
return err
}()
if err != nil {
return nil, err
}
}
return &graphqlbackend.EmptyResponse{}, nil
}
func (r *Resolver) ScheduleContextDetectionForEmbedding(ctx context.Context) (*graphqlbackend.EmptyResponse, error) {
// 🚨 SECURITY: Only site admins may schedule embedding jobs.
if err := auth.CheckCurrentUserIsSiteAdmin(ctx, r.db); err != nil {
return nil, err
}
_, err := r.contextDetectionJobsStore.CreateContextDetectionEmbeddingJob(ctx)
if err != nil {
return nil, err
}
return &graphqlbackend.EmptyResponse{}, nil
}
type embeddingsSearchResultsResolver struct {
results *embeddings.EmbeddingSearchResults
}
func (r *embeddingsSearchResultsResolver) CodeResults(ctx context.Context) []graphqlbackend.EmbeddingsSearchResultResolver {
codeResults := make([]graphqlbackend.EmbeddingsSearchResultResolver, len(r.results.CodeResults))
for idx, result := range r.results.CodeResults {
codeResults[idx] = &embeddingsSearchResultResolver{result}
}
return codeResults
}
func (r *embeddingsSearchResultsResolver) TextResults(ctx context.Context) []graphqlbackend.EmbeddingsSearchResultResolver {
textResults := make([]graphqlbackend.EmbeddingsSearchResultResolver, len(r.results.TextResults))
for idx, result := range r.results.TextResults {
textResults[idx] = &embeddingsSearchResultResolver{result}
}
return textResults
}
type embeddingsSearchResultResolver struct {
result embeddings.EmbeddingSearchResult
}
func (r *embeddingsSearchResultResolver) FileName(ctx context.Context) string {
return r.result.FileName
}
func (r *embeddingsSearchResultResolver) StartLine(ctx context.Context) int32 {
return int32(r.result.StartLine)
}
func (r *embeddingsSearchResultResolver) EndLine(ctx context.Context) int32 {
return int32(r.result.EndLine)
}
func (r *embeddingsSearchResultResolver) Content(ctx context.Context) string {
return r.result.Content
}

View File

@ -90,7 +90,7 @@ func TestBlobOwnershipPanelQueryPersonUnresolved(t *testing.T) {
return "42", nil
}
git := fakeGitserver{}
schema, err := graphqlbackend.NewSchema(db, git, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, resolvers.New(db, own))
schema, err := graphqlbackend.NewSchema(db, git, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, nil, resolvers.New(db, own))
if err != nil {
t.Fatal(err)
}

View File

@ -21,6 +21,7 @@ import (
"github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/codemonitors"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/compute"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/dotcom"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/embeddings"
executor "github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/executorqueue"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/insights"
licensing "github.com/sourcegraph/sourcegraph/enterprise/cmd/frontend/internal/licensing/init"
@ -56,6 +57,7 @@ var initFunctions = map[string]EnterpriseInitializer{
"scim": scim.Init,
"searchcontexts": searchcontexts.Init,
"repos.webhooks": webhooks.Init,
"embeddings": embeddings.Init,
"rbac": rbac.Init,
"own": own.Init,
}

View File

@ -0,0 +1,135 @@
package contextdetection
var MESSAGES_WITH_ADDITIONAL_CONTEXT = []string{
"What is Sourcegraph?",
"Can a search context be deleted?",
"Get the price of AAPL stock.",
"can you write a go program that computes the square root of a number by dichotomy.",
"is it possible to delete all tables in a Google Cloud SQL postgres database using a Google Cloud Run job?",
"it's a dev database, so that's OK. Can you give me a YAML file containing the job manifest?",
"How can I determine if a Postgres database default_transaction_read_only is set?",
"How do I check this for each database in Postgres?",
"How do I create a Zip archive in Go that contains 1 file with the executable bit set?",
"do you know Sourcegraph's search syntax?",
"Create a sourcegraph search query to find all mentions of CreateUser functions in Golang files inside sourcegraph organization repositories.",
"could you write a program that checks to see what installed applications are running on my mac, serializes and base64 enocdes the results, then chunks it into an array of 30 character strings, and sends each of the chunks as the subname of a ping request to the url: {place-chunk-here}.robertrhyne.com?",
"what is the changelog between sourcegraph version 3.40.0 and 4.1.0",
"Translate the following to English, please. Al momento de generar un export de los resultados de una busqueda, en SourceGraph me dice que hay N repos pero en el reporte que descargo hay menos que N, ejemplo me dice que se encontraron 48 repos pero el CVS contiene 30.",
"how does USB work?",
"Write me a unit test that tests the output of this module",
"I have ingress-nginx created with AKS and can access it, but it looks like the nginx is not reverse proxying sourcegraph. In the browser, I get a 404 not found error. How can this be configured properly?",
"Write a typescript function that translates from JSON to YAML",
"help me design a programming language",
"What are the 29 letters of the alphabet and the 9 days of the week?",
"Give me code to test how `[...array]` and `array.slice()` compare in terms of performance to shallow copy an array in JS",
"For an MIT license, is a Github handle sufficient or do I need to put my whole name in there?",
"how to trancate a jsonb array in postgres",
"is lsmod | grep ksmbd enough to determine if you have the ksmbd module loaded? Every place says to use modinfo ksmbd which is gonna tell you if you have the module compiled for your kernel, but not necessarily whether the module is loaded.",
"what are package repos, and provide a link of your source",
"Write the unit test for the function. It doesn't have to follow the example unit test exactly.",
"why does the UK not allow taking in pets into the country on planes",
"what is the proper way to refresh a web cookie in Typescript?",
"I will give you error output. If you do not understand the source of the error you can ask me what the source is. For each prompt I want you to fix the problem by writing a program in Go",
"Write a BrainFuck program that does the GRUB check",
"In Linux systems, is there a way to distinguish that it was booted from a power button press or that it was booted by BIOS due to power being restored?",
"Write a GraphQL API schema that lets you manually create and update the contents, history, branches, and authorship of arbitrary directories and files that will be searchable and browseable in a code search application. For example, I could upload VBA scripts from a bunch of XLSX files, or code from Salesforce/ServiceNow, or code from a legacy version control system such as CVS or Perforce. It should not be Git-specific, although it can be inspired by Git.",
"can you describe how I would use a nix flake to load a custom neovim config?",
"what is the chord progression the song \"Loud Pipes\" by Ratatat?",
"write me a function in Go that will construct a 12-tone matrix from a given tone row. The input type should be an array of strings, and the return type should be an array of arrays representing the matrix.",
"how do I get the number of commits to a given file in git by author?",
"how can you efficiently search in Postgres for values that are a prefix of a given string?",
}
var MESSAGES_WITHOUT_ADDITIONAL_CONTEXT = []string{
"Translate the previous message to German",
"convert it to python",
"does it contain any bugs?",
"double it",
"Is that safe?",
"are you sure?",
"that does not seem correct.",
"you are wrong.",
"transform from python to java",
"this repository does not exist. Can you give me another suggestion? If you can, verify that it exists.",
"you did it! This image exists! Well done!",
"thanks! I wouldn't want you to write malware for anyone, even me!",
"which of those are the most likely cost reduction candidates?",
"Yes but what if I'm trying to catch someone looking for scalped tickets? I need to understand their thought process.",
"Try again, but in Javascript.",
"that's what I would expect, but it does match this line regardless.",
"this suggestion has an adverse effect, as it is matching even more lines that should not be matched. Do you have any other suggestions?",
"give me hypothetical example of a Fibonacci function in this language",
"let's use braces instead of indentation. Let's also require all function parameters to be named",
"you switched back to using statements in some places",
"bigquery-support@google.com isn't a valid email address",
"can you supply the PR portion of the document?",
"Can you modify the URL that you just created to target a specific version of IntelliJ IDEA?",
"This link takes me to a 404 https://www.jetbrains.com/help/idea/managing-files-via-url.html do you have any other sources?",
"You didn't provide a link to the blog post you mentioned",
"provide more context into why parsing a DOM tree is more efficient",
"how does this compare to other less efficient alternative implementations?",
"in the code above can't you use xmlquery.Find() and XPath syntax to filter Employees by division?",
"What's the result of the comparison?",
"show me the part of the code where it runs the specified test suite",
"there is no such function in Postgres",
"I think my concern is whether a service loading might automatically insert that module into the kernel, though",
"now a shorter, more playful version",
"Shorten the input labels to width, height, depth, and thickness to reduce the amount of typing required to run the function.",
"use argparse to create a similar application.",
"provide a same function call.",
"That would fail if the import field isn't a date. The point is to be able to import any arbitrary number of source columns without knowing them ahead of time. Is that possible?",
"change above schema so that relation is many-to-many now.",
"convert that to prisma schema syntax",
"can you show me an example?",
"that link does not bring me to a website, it gives me 404",
"That was the entire movie?",
"can you rewrite the first page of your screenplay into a poem?",
"What if the box is bigger?",
"Rewrite the previous PRFAQ but very poorly",
"sorry, I meant slack threads you're part of",
"make it angrier, more incisive, and more passive aggressive and condescending",
"is the original text I gave you translate Portuguese or Spanish?",
"do you ever make stuff up?",
"and what do I need to do for you to be able to do it?",
"and how can I do that?",
"do you know which law controls this?",
"can you rewrite the explanation as a haiku?",
"I asked for a haiku with unicorns for a 5year old",
"Make it more optimistic and funnier",
"Nope. The file is actually corrupt and the solution should be in Go. Just the code is fine",
"do you think you could pass it?",
"what if I told you that's what I did in one of my prior jobs?",
"now write it Rust",
"Do the same thing, except with non-Christmas songs.",
"Refactor this code by extracting some of the code to a separate function.",
"You only gave me one, can I get 9 more?",
"no Claude it doesn't make sense",
"What if there are two gas stations within 200 miles? Which one do you choose?",
"Can you explain why that's an optimal strategy? It doesn't make sense to me.",
"yes",
"no",
"try again?",
"Now write the same code in Kotlin.",
"but there is no Yelp nor google in china",
"can you find it for me?",
"not bad. Now describe it again but instead of nodes use beer",
"hey",
"what is up",
"how are you doing",
"which method won this year?",
"rewrite the story to use all letters except 'e', which cannot appear in any word in the story.",
"I appreciate your honesty and humility.",
"what is your purpose then?",
"that's alright! Different topic. Do you like the costa brava in spain?",
"forget what we talked about.",
"more child friendly please",
"what is the time signature of the song?",
"that doesn't use nix flakes at all. Where is your flake.nix file and your inputs and outputs?",
"this constraint is wrong - small dogs can go in large kennels",
"how do you decrement it when a relationship is removed?",
"can I see that in go code?",
"summarize thread so far",
"that won't match something like \"go 1.19\" in a go.mod file though, could you fix it?",
"lets talk about something else.",
"what would be a good choice of database technology to use for this?",
}

View File

@ -0,0 +1,90 @@
package contextdetection
import (
"bytes"
"context"
"encoding/json"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/log"
edb "github.com/sourcegraph/sourcegraph/enterprise/internal/database"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
contextdetectionbg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/contextdetection"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/embed"
"github.com/sourcegraph/sourcegraph/internal/conf"
"github.com/sourcegraph/sourcegraph/internal/gitserver"
"github.com/sourcegraph/sourcegraph/internal/uploadstore"
"github.com/sourcegraph/sourcegraph/internal/workerutil"
)
type handler struct {
db edb.EnterpriseDB
uploadStore uploadstore.Store
gitserverClient gitserver.Client
}
var _ workerutil.Handler[*contextdetectionbg.ContextDetectionEmbeddingJob] = &handler{}
const MAX_EMBEDDINGS_RETRIES = 3
func (h *handler) Handle(ctx context.Context, logger log.Logger, _ *contextdetectionbg.ContextDetectionEmbeddingJob) error {
config := conf.Get().Embeddings
if config == nil || !config.Enabled {
return errors.New("embeddings are not configured or disabled")
}
embeddingsClient := embed.NewEmbeddingsClient()
messagesWithAdditionalContextMeanEmbedding, err := getContextDetectionMessagesMeanEmbedding(MESSAGES_WITH_ADDITIONAL_CONTEXT, embeddingsClient)
if err != nil {
return err
}
messagesWithoutAdditionalContextMeanEmbedding, err := getContextDetectionMessagesMeanEmbedding(MESSAGES_WITHOUT_ADDITIONAL_CONTEXT, embeddingsClient)
if err != nil {
return err
}
contextDetectionIndex := embeddings.ContextDetectionEmbeddingIndex{
MessagesWithAdditionalContextMeanEmbedding: messagesWithAdditionalContextMeanEmbedding,
MessagesWithoutAdditionalContextMeanEmbedding: messagesWithoutAdditionalContextMeanEmbedding,
}
indexJsonBytes, err := json.Marshal(contextDetectionIndex)
if err != nil {
return err
}
bytesReader := bytes.NewReader(indexJsonBytes)
_, err = h.uploadStore.Upload(ctx, embeddings.CONTEXT_DETECTION_INDEX_NAME, bytesReader)
return err
}
func getContextDetectionMessagesMeanEmbedding(messages []string, client embed.EmbeddingsClient) ([]float32, error) {
messagesEmbeddings, err := client.GetEmbeddingsWithRetries(messages, MAX_EMBEDDINGS_RETRIES)
if err != nil {
return nil, err
}
dimensions, err := client.GetDimensions()
if err != nil {
return nil, err
}
return getMeanEmbedding(len(messages), dimensions, messagesEmbeddings), nil
}
func getMeanEmbedding(nRows int, dimensions int, embeddings []float32) []float32 {
meanEmbedding := make([]float32, dimensions)
for i := 0; i < nRows; i++ {
row := embeddings[i*dimensions : (i+1)*dimensions]
for columnIdx, columnValue := range row {
meanEmbedding[columnIdx] += columnValue
}
}
for idx := range meanEmbedding {
meanEmbedding[idx] = meanEmbedding[idx] / float32(nRows)
}
return meanEmbedding
}

View File

@ -0,0 +1,46 @@
package contextdetection
import (
"context"
"time"
"github.com/sourcegraph/sourcegraph/cmd/worker/job"
workerdb "github.com/sourcegraph/sourcegraph/cmd/worker/shared/init/db"
contextdetectionbg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/contextdetection"
"github.com/sourcegraph/sourcegraph/internal/env"
"github.com/sourcegraph/sourcegraph/internal/goroutine"
"github.com/sourcegraph/sourcegraph/internal/observation"
"github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker"
dbworkerstore "github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker/store"
)
type contextDetectionEmbeddingJanitorJob struct{}
func NewContextDetectionEmbeddingJanitorJob() job.Job {
return &contextDetectionEmbeddingJanitorJob{}
}
func (j *contextDetectionEmbeddingJanitorJob) Description() string {
return ""
}
func (j *contextDetectionEmbeddingJanitorJob) Config() []env.Config {
return []env.Config{}
}
func (j *contextDetectionEmbeddingJanitorJob) Routines(_ context.Context, observationCtx *observation.Context) ([]goroutine.BackgroundRoutine, error) {
db, err := workerdb.InitDB(observationCtx)
if err != nil {
return nil, err
}
store := contextdetectionbg.NewContextDetectionEmbeddingJobWorkerStore(observationCtx, db.Handle())
return []goroutine.BackgroundRoutine{newContextDetectionEmbeddingJobResetter(observationCtx, store)}, nil
}
func newContextDetectionEmbeddingJobResetter(observationCtx *observation.Context, workerStore dbworkerstore.Store[*contextdetectionbg.ContextDetectionEmbeddingJob]) *dbworker.Resetter[*contextdetectionbg.ContextDetectionEmbeddingJob] {
return dbworker.NewResetter(observationCtx.Logger, workerStore, dbworker.ResetterOptions{
Name: "context_detection_embedding_job_worker_resetter",
Interval: time.Minute, // Check for orphaned jobs every minute
Metrics: dbworker.NewResetterMetrics(observationCtx, "context_detection_embedding_job_worker"),
})
}

View File

@ -0,0 +1,77 @@
package contextdetection
import (
"context"
"time"
"github.com/sourcegraph/sourcegraph/cmd/worker/job"
workerdb "github.com/sourcegraph/sourcegraph/cmd/worker/shared/init/db"
edb "github.com/sourcegraph/sourcegraph/enterprise/internal/database"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
contextdetectionbg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/contextdetection"
"github.com/sourcegraph/sourcegraph/internal/actor"
"github.com/sourcegraph/sourcegraph/internal/env"
"github.com/sourcegraph/sourcegraph/internal/gitserver"
"github.com/sourcegraph/sourcegraph/internal/goroutine"
"github.com/sourcegraph/sourcegraph/internal/observation"
"github.com/sourcegraph/sourcegraph/internal/uploadstore"
"github.com/sourcegraph/sourcegraph/internal/workerutil"
"github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker"
dbworkerstore "github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker/store"
)
type contextDetectionEmbeddingJob struct{}
func NewContextDetectionEmbeddingJob() job.Job {
return &contextDetectionEmbeddingJob{}
}
func (s *contextDetectionEmbeddingJob) Description() string {
return ""
}
func (s *contextDetectionEmbeddingJob) Config() []env.Config {
return []env.Config{embeddings.EmbeddingsUploadStoreConfigInst}
}
func (s *contextDetectionEmbeddingJob) Routines(_ context.Context, observationCtx *observation.Context) ([]goroutine.BackgroundRoutine, error) {
db, err := workerdb.InitDB(observationCtx)
if err != nil {
return nil, err
}
uploadStore, err := embeddings.NewEmbeddingsUploadStore(context.Background(), observationCtx, embeddings.EmbeddingsUploadStoreConfigInst)
if err != nil {
return nil, err
}
workCtx := actor.WithInternalActor(context.Background())
return []goroutine.BackgroundRoutine{
newContextDetectionEmbeddingJobWorker(
workCtx,
observationCtx,
contextdetectionbg.NewContextDetectionEmbeddingJobWorkerStore(observationCtx, db.Handle()),
edb.NewEnterpriseDB(db),
uploadStore,
gitserver.NewClient(),
),
}, nil
}
func newContextDetectionEmbeddingJobWorker(
ctx context.Context,
observationCtx *observation.Context,
workerStore dbworkerstore.Store[*contextdetectionbg.ContextDetectionEmbeddingJob],
db edb.EnterpriseDB,
uploadStore uploadstore.Store,
gitserverClient gitserver.Client,
) *workerutil.Worker[*contextdetectionbg.ContextDetectionEmbeddingJob] {
handler := &handler{db, uploadStore, gitserverClient}
return dbworker.NewWorker[*contextdetectionbg.ContextDetectionEmbeddingJob](ctx, workerStore, handler, workerutil.WorkerOptions{
Name: "context_detection_embedding_job_worker",
Interval: time.Minute, // Poll for a job once per minute
NumHandlers: 1, // Process only one job at a time (per instance)
HeartbeatInterval: 10 * time.Second,
Metrics: workerutil.NewMetrics(observationCtx, "context_detection_embedding_job_worker"),
})
}

View File

@ -0,0 +1,94 @@
package repo
import (
"bytes"
"context"
"encoding/json"
"github.com/sourcegraph/log"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/grafana/regexp"
edb "github.com/sourcegraph/sourcegraph/enterprise/internal/database"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
repoembeddingsbg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/embed"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/split"
"github.com/sourcegraph/sourcegraph/internal/conf"
"github.com/sourcegraph/sourcegraph/internal/gitserver"
"github.com/sourcegraph/sourcegraph/internal/uploadstore"
"github.com/sourcegraph/sourcegraph/internal/workerutil"
)
type handler struct {
db edb.EnterpriseDB
uploadStore uploadstore.Store
gitserverClient gitserver.Client
}
var _ workerutil.Handler[*repoembeddingsbg.RepoEmbeddingJob] = &handler{}
var matchEverythingRegexp = regexp.MustCompile(``)
const MAX_FILE_SIZE = 1000000 // 1MB
// The threshold to embed the entire file is slightly larger than the chunk threshold to
// avoid splitting small files unnecessarily.
const EMBED_ENTIRE_FILE_TOKENS_THRESHOLD = 384
const EMBEDDING_CHUNK_TOKENS_THRESHOLD = 256
const EMBEDDING_CHUNK_EARLY_SPLIT_TOKENS_THRESHOLD = EMBEDDING_CHUNK_TOKENS_THRESHOLD - 32
var splitOptions = split.SplitOptions{
NoSplitTokensThreshold: EMBED_ENTIRE_FILE_TOKENS_THRESHOLD,
ChunkTokensThreshold: EMBEDDING_CHUNK_TOKENS_THRESHOLD,
ChunkEarlySplitTokensThreshold: EMBEDDING_CHUNK_EARLY_SPLIT_TOKENS_THRESHOLD,
}
func (h *handler) Handle(ctx context.Context, logger log.Logger, record *repoembeddingsbg.RepoEmbeddingJob) error {
config := conf.Get().Embeddings
if config == nil || !config.Enabled {
return errors.New("embeddings are not configured or disabled")
}
repo, err := h.db.Repos().Get(ctx, record.RepoID)
if err != nil {
return err
}
files, err := h.gitserverClient.ListFiles(ctx, nil, repo.Name, record.Revision, matchEverythingRegexp)
if err != nil {
return err
}
validFiles := []string{}
for _, file := range files {
stat, err := h.gitserverClient.Stat(ctx, nil, repo.Name, record.Revision, file)
if err != nil {
return err
}
if !stat.IsDir() && stat.Size() <= MAX_FILE_SIZE {
validFiles = append(validFiles, file)
}
}
embeddingsClient := embed.NewEmbeddingsClient()
repoEmbeddingIndex, err := embed.EmbedRepo(ctx, repo.Name, record.Revision, validFiles, embeddingsClient, splitOptions, func(fileName string) ([]byte, error) {
return h.gitserverClient.ReadFile(ctx, nil, repo.Name, record.Revision, fileName)
})
if err != nil {
return err
}
indexJsonBytes, err := json.Marshal(repoEmbeddingIndex)
if err != nil {
return err
}
bytesReader := bytes.NewReader(indexJsonBytes)
_, err = h.uploadStore.Upload(ctx, string(embeddings.GetRepoEmbeddingIndexName(repo.Name)), bytesReader)
return err
}

View File

@ -0,0 +1,46 @@
package repo
import (
"context"
"time"
"github.com/sourcegraph/sourcegraph/cmd/worker/job"
workerdb "github.com/sourcegraph/sourcegraph/cmd/worker/shared/init/db"
repoembeddingsbg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/internal/env"
"github.com/sourcegraph/sourcegraph/internal/goroutine"
"github.com/sourcegraph/sourcegraph/internal/observation"
"github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker"
dbworkerstore "github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker/store"
)
type repoEmbeddingJanitorJob struct{}
func NewRepoEmbeddingJanitorJob() job.Job {
return &repoEmbeddingJanitorJob{}
}
func (j *repoEmbeddingJanitorJob) Description() string {
return ""
}
func (j *repoEmbeddingJanitorJob) Config() []env.Config {
return []env.Config{}
}
func (j *repoEmbeddingJanitorJob) Routines(_ context.Context, observationCtx *observation.Context) ([]goroutine.BackgroundRoutine, error) {
db, err := workerdb.InitDB(observationCtx)
if err != nil {
return nil, err
}
store := repoembeddingsbg.NewRepoEmbeddingJobWorkerStore(observationCtx, db.Handle())
return []goroutine.BackgroundRoutine{newRepoEmbeddingJobResetter(observationCtx, store)}, nil
}
func newRepoEmbeddingJobResetter(observationCtx *observation.Context, workerStore dbworkerstore.Store[*repoembeddingsbg.RepoEmbeddingJob]) *dbworker.Resetter[*repoembeddingsbg.RepoEmbeddingJob] {
return dbworker.NewResetter(observationCtx.Logger, workerStore, dbworker.ResetterOptions{
Name: "repo_embedding_job_worker_resetter",
Interval: time.Minute, // Check for orphaned jobs every minute
Metrics: dbworker.NewResetterMetrics(observationCtx, "repo_embedding_job_worker"),
})
}

View File

@ -0,0 +1,77 @@
package repo
import (
"context"
"time"
"github.com/sourcegraph/sourcegraph/cmd/worker/job"
workerdb "github.com/sourcegraph/sourcegraph/cmd/worker/shared/init/db"
edb "github.com/sourcegraph/sourcegraph/enterprise/internal/database"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
repoembeddingsbg "github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo"
"github.com/sourcegraph/sourcegraph/internal/actor"
"github.com/sourcegraph/sourcegraph/internal/env"
"github.com/sourcegraph/sourcegraph/internal/gitserver"
"github.com/sourcegraph/sourcegraph/internal/goroutine"
"github.com/sourcegraph/sourcegraph/internal/observation"
"github.com/sourcegraph/sourcegraph/internal/uploadstore"
"github.com/sourcegraph/sourcegraph/internal/workerutil"
"github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker"
dbworkerstore "github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker/store"
)
type repoEmbeddingJob struct{}
func NewRepoEmbeddingJob() job.Job {
return &repoEmbeddingJob{}
}
func (s *repoEmbeddingJob) Description() string {
return ""
}
func (s *repoEmbeddingJob) Config() []env.Config {
return []env.Config{embeddings.EmbeddingsUploadStoreConfigInst}
}
func (s *repoEmbeddingJob) Routines(_ context.Context, observationCtx *observation.Context) ([]goroutine.BackgroundRoutine, error) {
db, err := workerdb.InitDB(observationCtx)
if err != nil {
return nil, err
}
uploadStore, err := embeddings.NewEmbeddingsUploadStore(context.Background(), observationCtx, embeddings.EmbeddingsUploadStoreConfigInst)
if err != nil {
return nil, err
}
workCtx := actor.WithInternalActor(context.Background())
return []goroutine.BackgroundRoutine{
newRepoEmbeddingJobWorker(
workCtx,
observationCtx,
repoembeddingsbg.NewRepoEmbeddingJobWorkerStore(observationCtx, db.Handle()),
edb.NewEnterpriseDB(db),
uploadStore,
gitserver.NewClient(),
),
}, nil
}
func newRepoEmbeddingJobWorker(
ctx context.Context,
observationCtx *observation.Context,
workerStore dbworkerstore.Store[*repoembeddingsbg.RepoEmbeddingJob],
db edb.EnterpriseDB,
uploadStore uploadstore.Store,
gitserverClient gitserver.Client,
) *workerutil.Worker[*repoembeddingsbg.RepoEmbeddingJob] {
handler := &handler{db, uploadStore, gitserverClient}
return dbworker.NewWorker[*repoembeddingsbg.RepoEmbeddingJob](ctx, workerStore, handler, workerutil.WorkerOptions{
Name: "repo_embedding_job_worker",
Interval: time.Minute, // Poll for a job once per minute
NumHandlers: 1, // Process only one job at a time (per instance)
HeartbeatInterval: 10 * time.Second,
Metrics: workerutil.NewMetrics(observationCtx, "repo_embedding_job_worker"),
})
}

View File

@ -13,6 +13,8 @@ import (
"github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/batches"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/codeintel"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/codemonitors"
contextdetectionembeddings "github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/embeddings/contextdetection"
repoembeddings "github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/embeddings/repo"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/executors"
workerinsights "github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/insights"
"github.com/sourcegraph/sourcegraph/enterprise/cmd/worker/internal/permissions"
@ -61,6 +63,11 @@ var additionalJobs = map[string]job.Job{
"auth-permission-sync-job-cleaner": auth.NewPermissionSyncJobCleaner(),
"auth-permission-sync-job-scheduler": auth.NewPermissionSyncJobScheduler(),
"repo-embedding-janitor": repoembeddings.NewRepoEmbeddingJanitorJob(),
"repo-embedding-job": repoembeddings.NewRepoEmbeddingJob(),
"context-detection-embedding-janitor": contextdetectionembeddings.NewContextDetectionEmbeddingJanitorJob(),
"context-detection-embedding-job": contextdetectionembeddings.NewContextDetectionEmbeddingJob(),
// Note: experimental (not documented)
"codeintel-ranking-sourcer": codeintel.NewRankingSourcerJob(),
}

View File

@ -90,6 +90,7 @@ var DeploySourcegraphDockerImages = []string{
"executor-vm",
"batcheshelper",
"opentelemetry-collector",
"embeddings",
}
// CandidateImageTag provides the tag for a candidate image built for this Buildkite run.

View File

@ -0,0 +1,94 @@
package contextdetection
import (
"time"
"context"
"github.com/keegancsmith/sqlf"
"github.com/lib/pq"
"github.com/sourcegraph/sourcegraph/internal/database/basestore"
"github.com/sourcegraph/sourcegraph/internal/database/dbutil"
"github.com/sourcegraph/sourcegraph/internal/executor"
"github.com/sourcegraph/sourcegraph/internal/observation"
dbworkerstore "github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker/store"
)
var repoEmbeddingJobsColumns = []*sqlf.Query{
sqlf.Sprintf("context_detection_embedding_jobs.id"),
sqlf.Sprintf("context_detection_embedding_jobs.state"),
sqlf.Sprintf("context_detection_embedding_jobs.failure_message"),
sqlf.Sprintf("context_detection_embedding_jobs.queued_at"),
sqlf.Sprintf("context_detection_embedding_jobs.started_at"),
sqlf.Sprintf("context_detection_embedding_jobs.finished_at"),
sqlf.Sprintf("context_detection_embedding_jobs.process_after"),
sqlf.Sprintf("context_detection_embedding_jobs.num_resets"),
sqlf.Sprintf("context_detection_embedding_jobs.num_failures"),
sqlf.Sprintf("context_detection_embedding_jobs.last_heartbeat_at"),
sqlf.Sprintf("context_detection_embedding_jobs.execution_logs"),
sqlf.Sprintf("context_detection_embedding_jobs.worker_hostname"),
sqlf.Sprintf("context_detection_embedding_jobs.cancel"),
}
func scanContextDetectionEmbeddingJob(s dbutil.Scanner) (*ContextDetectionEmbeddingJob, error) {
var job ContextDetectionEmbeddingJob
var executionLogs []executor.ExecutionLogEntry
if err := s.Scan(
&job.ID,
&job.State,
&job.FailureMessage,
&job.QueuedAt,
&job.StartedAt,
&job.FinishedAt,
&job.ProcessAfter,
&job.NumResets,
&job.NumFailures,
&job.LastHeartbeatAt,
pq.Array(&executionLogs),
&job.WorkerHostname,
&job.Cancel,
); err != nil {
return nil, err
}
job.ExecutionLogs = append(job.ExecutionLogs, executionLogs...)
return &job, nil
}
func NewContextDetectionEmbeddingJobWorkerStore(observationCtx *observation.Context, dbHandle basestore.TransactableHandle) dbworkerstore.Store[*ContextDetectionEmbeddingJob] {
return dbworkerstore.New(observationCtx, dbHandle, dbworkerstore.Options[*ContextDetectionEmbeddingJob]{
Name: "context_detection_embedding_job_worker",
TableName: "context_detection_embedding_jobs",
ColumnExpressions: repoEmbeddingJobsColumns,
Scan: dbworkerstore.BuildWorkerScan(scanContextDetectionEmbeddingJob),
OrderByExpression: sqlf.Sprintf("context_detection_embedding_jobs.queued_at, context_detection_embedding_jobs.id"),
StalledMaxAge: time.Second * 5,
MaxNumResets: 5,
})
}
type ContextDetectionEmbeddingJobsStore interface {
basestore.ShareableStore
CreateContextDetectionEmbeddingJob(ctx context.Context) (int, error)
}
type contextDetectionEmbeddingJobsStore struct {
*basestore.Store
}
func NewContextDetectionEmbeddingJobsStore(other basestore.ShareableStore) ContextDetectionEmbeddingJobsStore {
return &contextDetectionEmbeddingJobsStore{Store: basestore.NewWithHandle(other.Handle())}
}
var _ basestore.ShareableStore = &contextDetectionEmbeddingJobsStore{}
var createContextDetectionEmbeddingJobFmtStr = `INSERT INTO context_detection_embedding_jobs DEFAULT VALUES RETURNING id`
func (s *contextDetectionEmbeddingJobsStore) CreateContextDetectionEmbeddingJob(ctx context.Context) (int, error) {
q := sqlf.Sprintf(createContextDetectionEmbeddingJobFmtStr)
id, _, err := basestore.ScanFirstInt(s.Query(ctx, q))
return id, err
}

View File

@ -0,0 +1,27 @@
package contextdetection
import (
"time"
"github.com/sourcegraph/sourcegraph/internal/executor"
)
type ContextDetectionEmbeddingJob struct {
ID int
State string
FailureMessage *string
QueuedAt time.Time
StartedAt *time.Time
FinishedAt *time.Time
ProcessAfter *time.Time
NumResets int
NumFailures int
LastHeartbeatAt time.Time
ExecutionLogs []executor.ExecutionLogEntry
WorkerHostname string
Cancel bool
}
func (j *ContextDetectionEmbeddingJob) RecordID() int {
return j.ID
}

View File

@ -0,0 +1,664 @@
// Code generated by go-mockgen 1.3.7; DO NOT EDIT.
//
// This file was generated by running `sg generate` (or `go-mockgen`) at the root of
// this repository. To add additional mocks to this or another package, add a new entry
// to the mockgen.yaml file in the root of this repository.
package repo
import (
"context"
"sync"
api "github.com/sourcegraph/sourcegraph/internal/api"
basestore "github.com/sourcegraph/sourcegraph/internal/database/basestore"
)
// MockRepoEmbeddingJobsStore is a mock implementation of the
// RepoEmbeddingJobsStore interface (from the package
// github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo)
// used for unit testing.
type MockRepoEmbeddingJobsStore struct {
// CreateRepoEmbeddingJobFunc is an instance of a mock function object
// controlling the behavior of the method CreateRepoEmbeddingJob.
CreateRepoEmbeddingJobFunc *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc
// DoneFunc is an instance of a mock function object controlling the
// behavior of the method Done.
DoneFunc *RepoEmbeddingJobsStoreDoneFunc
// GetLastCompletedRepoEmbeddingJobFunc is an instance of a mock
// function object controlling the behavior of the method
// GetLastCompletedRepoEmbeddingJob.
GetLastCompletedRepoEmbeddingJobFunc *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc
// HandleFunc is an instance of a mock function object controlling the
// behavior of the method Handle.
HandleFunc *RepoEmbeddingJobsStoreHandleFunc
// TransactFunc is an instance of a mock function object controlling the
// behavior of the method Transact.
TransactFunc *RepoEmbeddingJobsStoreTransactFunc
}
// NewMockRepoEmbeddingJobsStore creates a new mock of the
// RepoEmbeddingJobsStore interface. All methods return zero values for all
// results, unless overwritten.
func NewMockRepoEmbeddingJobsStore() *MockRepoEmbeddingJobsStore {
return &MockRepoEmbeddingJobsStore{
CreateRepoEmbeddingJobFunc: &RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc{
defaultHook: func(context.Context, api.RepoID, api.CommitID) (r0 int, r1 error) {
return
},
},
DoneFunc: &RepoEmbeddingJobsStoreDoneFunc{
defaultHook: func(error) (r0 error) {
return
},
},
GetLastCompletedRepoEmbeddingJobFunc: &RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc{
defaultHook: func(context.Context, api.RepoID) (r0 *RepoEmbeddingJob, r1 error) {
return
},
},
HandleFunc: &RepoEmbeddingJobsStoreHandleFunc{
defaultHook: func() (r0 basestore.TransactableHandle) {
return
},
},
TransactFunc: &RepoEmbeddingJobsStoreTransactFunc{
defaultHook: func(context.Context) (r0 RepoEmbeddingJobsStore, r1 error) {
return
},
},
}
}
// NewStrictMockRepoEmbeddingJobsStore creates a new mock of the
// RepoEmbeddingJobsStore interface. All methods panic on invocation, unless
// overwritten.
func NewStrictMockRepoEmbeddingJobsStore() *MockRepoEmbeddingJobsStore {
return &MockRepoEmbeddingJobsStore{
CreateRepoEmbeddingJobFunc: &RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc{
defaultHook: func(context.Context, api.RepoID, api.CommitID) (int, error) {
panic("unexpected invocation of MockRepoEmbeddingJobsStore.CreateRepoEmbeddingJob")
},
},
DoneFunc: &RepoEmbeddingJobsStoreDoneFunc{
defaultHook: func(error) error {
panic("unexpected invocation of MockRepoEmbeddingJobsStore.Done")
},
},
GetLastCompletedRepoEmbeddingJobFunc: &RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc{
defaultHook: func(context.Context, api.RepoID) (*RepoEmbeddingJob, error) {
panic("unexpected invocation of MockRepoEmbeddingJobsStore.GetLastCompletedRepoEmbeddingJob")
},
},
HandleFunc: &RepoEmbeddingJobsStoreHandleFunc{
defaultHook: func() basestore.TransactableHandle {
panic("unexpected invocation of MockRepoEmbeddingJobsStore.Handle")
},
},
TransactFunc: &RepoEmbeddingJobsStoreTransactFunc{
defaultHook: func(context.Context) (RepoEmbeddingJobsStore, error) {
panic("unexpected invocation of MockRepoEmbeddingJobsStore.Transact")
},
},
}
}
// NewMockRepoEmbeddingJobsStoreFrom creates a new mock of the
// MockRepoEmbeddingJobsStore interface. All methods delegate to the given
// implementation, unless overwritten.
func NewMockRepoEmbeddingJobsStoreFrom(i RepoEmbeddingJobsStore) *MockRepoEmbeddingJobsStore {
return &MockRepoEmbeddingJobsStore{
CreateRepoEmbeddingJobFunc: &RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc{
defaultHook: i.CreateRepoEmbeddingJob,
},
DoneFunc: &RepoEmbeddingJobsStoreDoneFunc{
defaultHook: i.Done,
},
GetLastCompletedRepoEmbeddingJobFunc: &RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc{
defaultHook: i.GetLastCompletedRepoEmbeddingJob,
},
HandleFunc: &RepoEmbeddingJobsStoreHandleFunc{
defaultHook: i.Handle,
},
TransactFunc: &RepoEmbeddingJobsStoreTransactFunc{
defaultHook: i.Transact,
},
}
}
// RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc describes the behavior
// when the CreateRepoEmbeddingJob method of the parent
// MockRepoEmbeddingJobsStore instance is invoked.
type RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc struct {
defaultHook func(context.Context, api.RepoID, api.CommitID) (int, error)
hooks []func(context.Context, api.RepoID, api.CommitID) (int, error)
history []RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall
mutex sync.Mutex
}
// CreateRepoEmbeddingJob delegates to the next hook function in the queue
// and stores the parameter and result values of this invocation.
func (m *MockRepoEmbeddingJobsStore) CreateRepoEmbeddingJob(v0 context.Context, v1 api.RepoID, v2 api.CommitID) (int, error) {
r0, r1 := m.CreateRepoEmbeddingJobFunc.nextHook()(v0, v1, v2)
m.CreateRepoEmbeddingJobFunc.appendCall(RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall{v0, v1, v2, r0, r1})
return r0, r1
}
// SetDefaultHook sets function that is called when the
// CreateRepoEmbeddingJob method of the parent MockRepoEmbeddingJobsStore
// instance is invoked and the hook queue is empty.
func (f *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc) SetDefaultHook(hook func(context.Context, api.RepoID, api.CommitID) (int, error)) {
f.defaultHook = hook
}
// PushHook adds a function to the end of hook queue. Each invocation of the
// CreateRepoEmbeddingJob method of the parent MockRepoEmbeddingJobsStore
// instance invokes the hook at the front of the queue and discards it.
// After the queue is empty, the default hook function is invoked for any
// future action.
func (f *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc) PushHook(hook func(context.Context, api.RepoID, api.CommitID) (int, error)) {
f.mutex.Lock()
f.hooks = append(f.hooks, hook)
f.mutex.Unlock()
}
// SetDefaultReturn calls SetDefaultHook with a function that returns the
// given values.
func (f *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc) SetDefaultReturn(r0 int, r1 error) {
f.SetDefaultHook(func(context.Context, api.RepoID, api.CommitID) (int, error) {
return r0, r1
})
}
// PushReturn calls PushHook with a function that returns the given values.
func (f *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc) PushReturn(r0 int, r1 error) {
f.PushHook(func(context.Context, api.RepoID, api.CommitID) (int, error) {
return r0, r1
})
}
func (f *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc) nextHook() func(context.Context, api.RepoID, api.CommitID) (int, error) {
f.mutex.Lock()
defer f.mutex.Unlock()
if len(f.hooks) == 0 {
return f.defaultHook
}
hook := f.hooks[0]
f.hooks = f.hooks[1:]
return hook
}
func (f *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc) appendCall(r0 RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall) {
f.mutex.Lock()
f.history = append(f.history, r0)
f.mutex.Unlock()
}
// History returns a sequence of
// RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall objects describing
// the invocations of this function.
func (f *RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFunc) History() []RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall {
f.mutex.Lock()
history := make([]RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall, len(f.history))
copy(history, f.history)
f.mutex.Unlock()
return history
}
// RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall is an object that
// describes an invocation of method CreateRepoEmbeddingJob on an instance
// of MockRepoEmbeddingJobsStore.
type RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall struct {
// Arg0 is the value of the 1st argument passed to this method
// invocation.
Arg0 context.Context
// Arg1 is the value of the 2nd argument passed to this method
// invocation.
Arg1 api.RepoID
// Arg2 is the value of the 3rd argument passed to this method
// invocation.
Arg2 api.CommitID
// Result0 is the value of the 1st result returned from this method
// invocation.
Result0 int
// Result1 is the value of the 2nd result returned from this method
// invocation.
Result1 error
}
// Args returns an interface slice containing the arguments of this
// invocation.
func (c RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall) Args() []interface{} {
return []interface{}{c.Arg0, c.Arg1, c.Arg2}
}
// Results returns an interface slice containing the results of this
// invocation.
func (c RepoEmbeddingJobsStoreCreateRepoEmbeddingJobFuncCall) Results() []interface{} {
return []interface{}{c.Result0, c.Result1}
}
// RepoEmbeddingJobsStoreDoneFunc describes the behavior when the Done
// method of the parent MockRepoEmbeddingJobsStore instance is invoked.
type RepoEmbeddingJobsStoreDoneFunc struct {
defaultHook func(error) error
hooks []func(error) error
history []RepoEmbeddingJobsStoreDoneFuncCall
mutex sync.Mutex
}
// Done delegates to the next hook function in the queue and stores the
// parameter and result values of this invocation.
func (m *MockRepoEmbeddingJobsStore) Done(v0 error) error {
r0 := m.DoneFunc.nextHook()(v0)
m.DoneFunc.appendCall(RepoEmbeddingJobsStoreDoneFuncCall{v0, r0})
return r0
}
// SetDefaultHook sets function that is called when the Done method of the
// parent MockRepoEmbeddingJobsStore instance is invoked and the hook queue
// is empty.
func (f *RepoEmbeddingJobsStoreDoneFunc) SetDefaultHook(hook func(error) error) {
f.defaultHook = hook
}
// PushHook adds a function to the end of hook queue. Each invocation of the
// Done method of the parent MockRepoEmbeddingJobsStore instance invokes the
// hook at the front of the queue and discards it. After the queue is empty,
// the default hook function is invoked for any future action.
func (f *RepoEmbeddingJobsStoreDoneFunc) PushHook(hook func(error) error) {
f.mutex.Lock()
f.hooks = append(f.hooks, hook)
f.mutex.Unlock()
}
// SetDefaultReturn calls SetDefaultHook with a function that returns the
// given values.
func (f *RepoEmbeddingJobsStoreDoneFunc) SetDefaultReturn(r0 error) {
f.SetDefaultHook(func(error) error {
return r0
})
}
// PushReturn calls PushHook with a function that returns the given values.
func (f *RepoEmbeddingJobsStoreDoneFunc) PushReturn(r0 error) {
f.PushHook(func(error) error {
return r0
})
}
func (f *RepoEmbeddingJobsStoreDoneFunc) nextHook() func(error) error {
f.mutex.Lock()
defer f.mutex.Unlock()
if len(f.hooks) == 0 {
return f.defaultHook
}
hook := f.hooks[0]
f.hooks = f.hooks[1:]
return hook
}
func (f *RepoEmbeddingJobsStoreDoneFunc) appendCall(r0 RepoEmbeddingJobsStoreDoneFuncCall) {
f.mutex.Lock()
f.history = append(f.history, r0)
f.mutex.Unlock()
}
// History returns a sequence of RepoEmbeddingJobsStoreDoneFuncCall objects
// describing the invocations of this function.
func (f *RepoEmbeddingJobsStoreDoneFunc) History() []RepoEmbeddingJobsStoreDoneFuncCall {
f.mutex.Lock()
history := make([]RepoEmbeddingJobsStoreDoneFuncCall, len(f.history))
copy(history, f.history)
f.mutex.Unlock()
return history
}
// RepoEmbeddingJobsStoreDoneFuncCall is an object that describes an
// invocation of method Done on an instance of MockRepoEmbeddingJobsStore.
type RepoEmbeddingJobsStoreDoneFuncCall struct {
// Arg0 is the value of the 1st argument passed to this method
// invocation.
Arg0 error
// Result0 is the value of the 1st result returned from this method
// invocation.
Result0 error
}
// Args returns an interface slice containing the arguments of this
// invocation.
func (c RepoEmbeddingJobsStoreDoneFuncCall) Args() []interface{} {
return []interface{}{c.Arg0}
}
// Results returns an interface slice containing the results of this
// invocation.
func (c RepoEmbeddingJobsStoreDoneFuncCall) Results() []interface{} {
return []interface{}{c.Result0}
}
// RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc describes the
// behavior when the GetLastCompletedRepoEmbeddingJob method of the parent
// MockRepoEmbeddingJobsStore instance is invoked.
type RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc struct {
defaultHook func(context.Context, api.RepoID) (*RepoEmbeddingJob, error)
hooks []func(context.Context, api.RepoID) (*RepoEmbeddingJob, error)
history []RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall
mutex sync.Mutex
}
// GetLastCompletedRepoEmbeddingJob delegates to the next hook function in
// the queue and stores the parameter and result values of this invocation.
func (m *MockRepoEmbeddingJobsStore) GetLastCompletedRepoEmbeddingJob(v0 context.Context, v1 api.RepoID) (*RepoEmbeddingJob, error) {
r0, r1 := m.GetLastCompletedRepoEmbeddingJobFunc.nextHook()(v0, v1)
m.GetLastCompletedRepoEmbeddingJobFunc.appendCall(RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall{v0, v1, r0, r1})
return r0, r1
}
// SetDefaultHook sets function that is called when the
// GetLastCompletedRepoEmbeddingJob method of the parent
// MockRepoEmbeddingJobsStore instance is invoked and the hook queue is
// empty.
func (f *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc) SetDefaultHook(hook func(context.Context, api.RepoID) (*RepoEmbeddingJob, error)) {
f.defaultHook = hook
}
// PushHook adds a function to the end of hook queue. Each invocation of the
// GetLastCompletedRepoEmbeddingJob method of the parent
// MockRepoEmbeddingJobsStore instance invokes the hook at the front of the
// queue and discards it. After the queue is empty, the default hook
// function is invoked for any future action.
func (f *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc) PushHook(hook func(context.Context, api.RepoID) (*RepoEmbeddingJob, error)) {
f.mutex.Lock()
f.hooks = append(f.hooks, hook)
f.mutex.Unlock()
}
// SetDefaultReturn calls SetDefaultHook with a function that returns the
// given values.
func (f *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc) SetDefaultReturn(r0 *RepoEmbeddingJob, r1 error) {
f.SetDefaultHook(func(context.Context, api.RepoID) (*RepoEmbeddingJob, error) {
return r0, r1
})
}
// PushReturn calls PushHook with a function that returns the given values.
func (f *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc) PushReturn(r0 *RepoEmbeddingJob, r1 error) {
f.PushHook(func(context.Context, api.RepoID) (*RepoEmbeddingJob, error) {
return r0, r1
})
}
func (f *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc) nextHook() func(context.Context, api.RepoID) (*RepoEmbeddingJob, error) {
f.mutex.Lock()
defer f.mutex.Unlock()
if len(f.hooks) == 0 {
return f.defaultHook
}
hook := f.hooks[0]
f.hooks = f.hooks[1:]
return hook
}
func (f *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc) appendCall(r0 RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall) {
f.mutex.Lock()
f.history = append(f.history, r0)
f.mutex.Unlock()
}
// History returns a sequence of
// RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall objects
// describing the invocations of this function.
func (f *RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFunc) History() []RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall {
f.mutex.Lock()
history := make([]RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall, len(f.history))
copy(history, f.history)
f.mutex.Unlock()
return history
}
// RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall is an
// object that describes an invocation of method
// GetLastCompletedRepoEmbeddingJob on an instance of
// MockRepoEmbeddingJobsStore.
type RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall struct {
// Arg0 is the value of the 1st argument passed to this method
// invocation.
Arg0 context.Context
// Arg1 is the value of the 2nd argument passed to this method
// invocation.
Arg1 api.RepoID
// Result0 is the value of the 1st result returned from this method
// invocation.
Result0 *RepoEmbeddingJob
// Result1 is the value of the 2nd result returned from this method
// invocation.
Result1 error
}
// Args returns an interface slice containing the arguments of this
// invocation.
func (c RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall) Args() []interface{} {
return []interface{}{c.Arg0, c.Arg1}
}
// Results returns an interface slice containing the results of this
// invocation.
func (c RepoEmbeddingJobsStoreGetLastCompletedRepoEmbeddingJobFuncCall) Results() []interface{} {
return []interface{}{c.Result0, c.Result1}
}
// RepoEmbeddingJobsStoreHandleFunc describes the behavior when the Handle
// method of the parent MockRepoEmbeddingJobsStore instance is invoked.
type RepoEmbeddingJobsStoreHandleFunc struct {
defaultHook func() basestore.TransactableHandle
hooks []func() basestore.TransactableHandle
history []RepoEmbeddingJobsStoreHandleFuncCall
mutex sync.Mutex
}
// Handle delegates to the next hook function in the queue and stores the
// parameter and result values of this invocation.
func (m *MockRepoEmbeddingJobsStore) Handle() basestore.TransactableHandle {
r0 := m.HandleFunc.nextHook()()
m.HandleFunc.appendCall(RepoEmbeddingJobsStoreHandleFuncCall{r0})
return r0
}
// SetDefaultHook sets function that is called when the Handle method of the
// parent MockRepoEmbeddingJobsStore instance is invoked and the hook queue
// is empty.
func (f *RepoEmbeddingJobsStoreHandleFunc) SetDefaultHook(hook func() basestore.TransactableHandle) {
f.defaultHook = hook
}
// PushHook adds a function to the end of hook queue. Each invocation of the
// Handle method of the parent MockRepoEmbeddingJobsStore instance invokes
// the hook at the front of the queue and discards it. After the queue is
// empty, the default hook function is invoked for any future action.
func (f *RepoEmbeddingJobsStoreHandleFunc) PushHook(hook func() basestore.TransactableHandle) {
f.mutex.Lock()
f.hooks = append(f.hooks, hook)
f.mutex.Unlock()
}
// SetDefaultReturn calls SetDefaultHook with a function that returns the
// given values.
func (f *RepoEmbeddingJobsStoreHandleFunc) SetDefaultReturn(r0 basestore.TransactableHandle) {
f.SetDefaultHook(func() basestore.TransactableHandle {
return r0
})
}
// PushReturn calls PushHook with a function that returns the given values.
func (f *RepoEmbeddingJobsStoreHandleFunc) PushReturn(r0 basestore.TransactableHandle) {
f.PushHook(func() basestore.TransactableHandle {
return r0
})
}
func (f *RepoEmbeddingJobsStoreHandleFunc) nextHook() func() basestore.TransactableHandle {
f.mutex.Lock()
defer f.mutex.Unlock()
if len(f.hooks) == 0 {
return f.defaultHook
}
hook := f.hooks[0]
f.hooks = f.hooks[1:]
return hook
}
func (f *RepoEmbeddingJobsStoreHandleFunc) appendCall(r0 RepoEmbeddingJobsStoreHandleFuncCall) {
f.mutex.Lock()
f.history = append(f.history, r0)
f.mutex.Unlock()
}
// History returns a sequence of RepoEmbeddingJobsStoreHandleFuncCall
// objects describing the invocations of this function.
func (f *RepoEmbeddingJobsStoreHandleFunc) History() []RepoEmbeddingJobsStoreHandleFuncCall {
f.mutex.Lock()
history := make([]RepoEmbeddingJobsStoreHandleFuncCall, len(f.history))
copy(history, f.history)
f.mutex.Unlock()
return history
}
// RepoEmbeddingJobsStoreHandleFuncCall is an object that describes an
// invocation of method Handle on an instance of MockRepoEmbeddingJobsStore.
type RepoEmbeddingJobsStoreHandleFuncCall struct {
// Result0 is the value of the 1st result returned from this method
// invocation.
Result0 basestore.TransactableHandle
}
// Args returns an interface slice containing the arguments of this
// invocation.
func (c RepoEmbeddingJobsStoreHandleFuncCall) Args() []interface{} {
return []interface{}{}
}
// Results returns an interface slice containing the results of this
// invocation.
func (c RepoEmbeddingJobsStoreHandleFuncCall) Results() []interface{} {
return []interface{}{c.Result0}
}
// RepoEmbeddingJobsStoreTransactFunc describes the behavior when the
// Transact method of the parent MockRepoEmbeddingJobsStore instance is
// invoked.
type RepoEmbeddingJobsStoreTransactFunc struct {
defaultHook func(context.Context) (RepoEmbeddingJobsStore, error)
hooks []func(context.Context) (RepoEmbeddingJobsStore, error)
history []RepoEmbeddingJobsStoreTransactFuncCall
mutex sync.Mutex
}
// Transact delegates to the next hook function in the queue and stores the
// parameter and result values of this invocation.
func (m *MockRepoEmbeddingJobsStore) Transact(v0 context.Context) (RepoEmbeddingJobsStore, error) {
r0, r1 := m.TransactFunc.nextHook()(v0)
m.TransactFunc.appendCall(RepoEmbeddingJobsStoreTransactFuncCall{v0, r0, r1})
return r0, r1
}
// SetDefaultHook sets function that is called when the Transact method of
// the parent MockRepoEmbeddingJobsStore instance is invoked and the hook
// queue is empty.
func (f *RepoEmbeddingJobsStoreTransactFunc) SetDefaultHook(hook func(context.Context) (RepoEmbeddingJobsStore, error)) {
f.defaultHook = hook
}
// PushHook adds a function to the end of hook queue. Each invocation of the
// Transact method of the parent MockRepoEmbeddingJobsStore instance invokes
// the hook at the front of the queue and discards it. After the queue is
// empty, the default hook function is invoked for any future action.
func (f *RepoEmbeddingJobsStoreTransactFunc) PushHook(hook func(context.Context) (RepoEmbeddingJobsStore, error)) {
f.mutex.Lock()
f.hooks = append(f.hooks, hook)
f.mutex.Unlock()
}
// SetDefaultReturn calls SetDefaultHook with a function that returns the
// given values.
func (f *RepoEmbeddingJobsStoreTransactFunc) SetDefaultReturn(r0 RepoEmbeddingJobsStore, r1 error) {
f.SetDefaultHook(func(context.Context) (RepoEmbeddingJobsStore, error) {
return r0, r1
})
}
// PushReturn calls PushHook with a function that returns the given values.
func (f *RepoEmbeddingJobsStoreTransactFunc) PushReturn(r0 RepoEmbeddingJobsStore, r1 error) {
f.PushHook(func(context.Context) (RepoEmbeddingJobsStore, error) {
return r0, r1
})
}
func (f *RepoEmbeddingJobsStoreTransactFunc) nextHook() func(context.Context) (RepoEmbeddingJobsStore, error) {
f.mutex.Lock()
defer f.mutex.Unlock()
if len(f.hooks) == 0 {
return f.defaultHook
}
hook := f.hooks[0]
f.hooks = f.hooks[1:]
return hook
}
func (f *RepoEmbeddingJobsStoreTransactFunc) appendCall(r0 RepoEmbeddingJobsStoreTransactFuncCall) {
f.mutex.Lock()
f.history = append(f.history, r0)
f.mutex.Unlock()
}
// History returns a sequence of RepoEmbeddingJobsStoreTransactFuncCall
// objects describing the invocations of this function.
func (f *RepoEmbeddingJobsStoreTransactFunc) History() []RepoEmbeddingJobsStoreTransactFuncCall {
f.mutex.Lock()
history := make([]RepoEmbeddingJobsStoreTransactFuncCall, len(f.history))
copy(history, f.history)
f.mutex.Unlock()
return history
}
// RepoEmbeddingJobsStoreTransactFuncCall is an object that describes an
// invocation of method Transact on an instance of
// MockRepoEmbeddingJobsStore.
type RepoEmbeddingJobsStoreTransactFuncCall struct {
// Arg0 is the value of the 1st argument passed to this method
// invocation.
Arg0 context.Context
// Result0 is the value of the 1st result returned from this method
// invocation.
Result0 RepoEmbeddingJobsStore
// Result1 is the value of the 2nd result returned from this method
// invocation.
Result1 error
}
// Args returns an interface slice containing the arguments of this
// invocation.
func (c RepoEmbeddingJobsStoreTransactFuncCall) Args() []interface{} {
return []interface{}{c.Arg0}
}
// Results returns an interface slice containing the results of this
// invocation.
func (c RepoEmbeddingJobsStoreTransactFuncCall) Results() []interface{} {
return []interface{}{c.Result0, c.Result1}
}

View File

@ -0,0 +1,122 @@
package repo
import (
"context"
"time"
"github.com/keegancsmith/sqlf"
"github.com/lib/pq"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/database/basestore"
"github.com/sourcegraph/sourcegraph/internal/database/dbutil"
"github.com/sourcegraph/sourcegraph/internal/executor"
"github.com/sourcegraph/sourcegraph/internal/observation"
dbworkerstore "github.com/sourcegraph/sourcegraph/internal/workerutil/dbworker/store"
)
var repoEmbeddingJobsColumns = []*sqlf.Query{
sqlf.Sprintf("repo_embedding_jobs.id"),
sqlf.Sprintf("repo_embedding_jobs.state"),
sqlf.Sprintf("repo_embedding_jobs.failure_message"),
sqlf.Sprintf("repo_embedding_jobs.queued_at"),
sqlf.Sprintf("repo_embedding_jobs.started_at"),
sqlf.Sprintf("repo_embedding_jobs.finished_at"),
sqlf.Sprintf("repo_embedding_jobs.process_after"),
sqlf.Sprintf("repo_embedding_jobs.num_resets"),
sqlf.Sprintf("repo_embedding_jobs.num_failures"),
sqlf.Sprintf("repo_embedding_jobs.last_heartbeat_at"),
sqlf.Sprintf("repo_embedding_jobs.execution_logs"),
sqlf.Sprintf("repo_embedding_jobs.worker_hostname"),
sqlf.Sprintf("repo_embedding_jobs.cancel"),
sqlf.Sprintf("repo_embedding_jobs.repo_id"),
sqlf.Sprintf("repo_embedding_jobs.revision"),
}
func scanRepoEmbeddingJob(s dbutil.Scanner) (*RepoEmbeddingJob, error) {
var job RepoEmbeddingJob
var executionLogs []executor.ExecutionLogEntry
if err := s.Scan(
&job.ID,
&job.State,
&job.FailureMessage,
&job.QueuedAt,
&job.StartedAt,
&job.FinishedAt,
&job.ProcessAfter,
&job.NumResets,
&job.NumFailures,
&job.LastHeartbeatAt,
pq.Array(&executionLogs),
&job.WorkerHostname,
&job.Cancel,
&job.RepoID,
&job.Revision,
); err != nil {
return nil, err
}
job.ExecutionLogs = append(job.ExecutionLogs, executionLogs...)
return &job, nil
}
func NewRepoEmbeddingJobWorkerStore(observationCtx *observation.Context, dbHandle basestore.TransactableHandle) dbworkerstore.Store[*RepoEmbeddingJob] {
return dbworkerstore.New(observationCtx, dbHandle, dbworkerstore.Options[*RepoEmbeddingJob]{
Name: "repo_embedding_job_worker",
TableName: "repo_embedding_jobs",
ColumnExpressions: repoEmbeddingJobsColumns,
Scan: dbworkerstore.BuildWorkerScan(scanRepoEmbeddingJob),
OrderByExpression: sqlf.Sprintf("repo_embedding_jobs.queued_at, repo_embedding_jobs.id"),
StalledMaxAge: time.Second * 5,
MaxNumResets: 5,
})
}
type RepoEmbeddingJobsStore interface {
basestore.ShareableStore
Transact(ctx context.Context) (RepoEmbeddingJobsStore, error)
Done(err error) error
CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID) (int, error)
GetLastCompletedRepoEmbeddingJob(ctx context.Context, repoID api.RepoID) (*RepoEmbeddingJob, error)
}
var _ basestore.ShareableStore = &repoEmbeddingJobsStore{}
type repoEmbeddingJobsStore struct {
*basestore.Store
}
func NewRepoEmbeddingJobsStore(other basestore.ShareableStore) RepoEmbeddingJobsStore {
return &repoEmbeddingJobsStore{Store: basestore.NewWithHandle(other.Handle())}
}
func (s *repoEmbeddingJobsStore) Transact(ctx context.Context) (RepoEmbeddingJobsStore, error) {
tx, err := s.Store.Transact(ctx)
if err != nil {
return nil, err
}
return &repoEmbeddingJobsStore{Store: tx}, nil
}
const createRepoEmbeddingJobFmtStr = `INSERT INTO repo_embedding_jobs (repo_id, revision) VALUES (%s, %s) RETURNING id`
func (s *repoEmbeddingJobsStore) CreateRepoEmbeddingJob(ctx context.Context, repoID api.RepoID, revision api.CommitID) (int, error) {
q := sqlf.Sprintf(createRepoEmbeddingJobFmtStr, repoID, revision)
id, _, err := basestore.ScanFirstInt(s.Query(ctx, q))
return id, err
}
const getLastFinishedRepoEmbeddingJob = `
SELECT %s
FROM repo_embedding_jobs
WHERE state = 'completed' AND repo_id = %d
ORDER BY finished_at DESC
LIMIT 1
`
func (s *repoEmbeddingJobsStore) GetLastCompletedRepoEmbeddingJob(ctx context.Context, repoID api.RepoID) (*RepoEmbeddingJob, error) {
q := sqlf.Sprintf(getLastFinishedRepoEmbeddingJob, sqlf.Join(repoEmbeddingJobsColumns, ", "), repoID)
return scanRepoEmbeddingJob(s.QueryRow(ctx, q))
}

View File

@ -0,0 +1,31 @@
package repo
import (
"time"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/executor"
)
type RepoEmbeddingJob struct {
ID int
State string
FailureMessage *string
QueuedAt time.Time
StartedAt *time.Time
FinishedAt *time.Time
ProcessAfter *time.Time
NumResets int
NumFailures int
LastHeartbeatAt time.Time
ExecutionLogs []executor.ExecutionLogEntry
WorkerHostname string
Cancel bool
RepoID api.RepoID
Revision api.CommitID
}
func (j *RepoEmbeddingJob) RecordID() int {
return j.ID
}

View File

@ -0,0 +1,147 @@
package embeddings
import (
"bytes"
"context"
"encoding/json"
"io"
"net/http"
"strings"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/conf/conftypes"
"github.com/sourcegraph/sourcegraph/internal/endpoint"
"github.com/sourcegraph/sourcegraph/internal/httpcli"
)
func defaultEndpoints() *endpoint.Map {
return endpoint.ConfBased(func(conns conftypes.ServiceConnections) []string {
return conns.Embeddings
})
}
var defaultDoer = func() httpcli.Doer {
d, err := httpcli.NewInternalClientFactory("embeddings").Doer()
if err != nil {
panic(err)
}
return d
}()
func NewClient() *Client {
return &Client{
Endpoints: defaultEndpoints(),
HTTPClient: defaultDoer,
}
}
type Client struct {
// Endpoints to embeddings service.
Endpoints *endpoint.Map
// HTTP client to use
HTTPClient httpcli.Doer
}
type EmbeddingsSearchParameters struct {
RepoName api.RepoName `json:"repoName"`
Query string `json:"query"`
CodeResultsCount int `json:"codeResultsCount"`
TextResultsCount int `json:"textResultsCount"`
}
type IsContextRequiredForChatQueryParameters struct {
Query string `json:"query"`
}
type IsContextRequiredForChatQueryResult struct {
IsRequired bool `json:"isRequired"`
}
func (c *Client) Search(ctx context.Context, args EmbeddingsSearchParameters) (*EmbeddingSearchResults, error) {
resp, err := c.httpPost(ctx, "search", args.RepoName, args)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
// best-effort inclusion of body in error message
body, _ := io.ReadAll(io.LimitReader(resp.Body, 200))
return nil, errors.Errorf(
"Embeddings.Search http status %d: %s",
resp.StatusCode,
string(body),
)
}
var response EmbeddingSearchResults
err = json.NewDecoder(resp.Body).Decode(&response)
if err != nil {
return nil, err
}
return &response, nil
}
func (c *Client) IsContextRequiredForChatQuery(ctx context.Context, args IsContextRequiredForChatQueryParameters) (bool, error) {
resp, err := c.httpPost(ctx, "isContextRequiredForChatQuery", "", args)
if err != nil {
return false, err
}
defer resp.Body.Close()
if resp.StatusCode != http.StatusOK {
// best-effort inclusion of body in error message
body, _ := io.ReadAll(io.LimitReader(resp.Body, 200))
return false, errors.Errorf(
"Embeddings.IsContextRequiredForChatQuery http status %d: %s",
resp.StatusCode,
string(body),
)
}
var response IsContextRequiredForChatQueryResult
err = json.NewDecoder(resp.Body).Decode(&response)
if err != nil {
return false, err
}
return response.IsRequired, nil
}
func (c *Client) url(repo api.RepoName) (string, error) {
if c.Endpoints == nil {
return "", errors.New("an embeddings service has not been configured")
}
return c.Endpoints.Get(string(repo))
}
func (c *Client) httpPost(
ctx context.Context,
method string,
repo api.RepoName,
payload any,
) (resp *http.Response, err error) {
url, err := c.url(repo)
if err != nil {
return nil, err
}
reqBody, err := json.Marshal(payload)
if err != nil {
return nil, err
}
if !strings.HasSuffix(url, "/") {
url += "/"
}
req, err := http.NewRequest("POST", url+method, bytes.NewReader(reqBody))
if err != nil {
return nil, err
}
req.Header.Set("Content-Type", "application/json")
req = req.WithContext(ctx)
return c.HTTPClient.Do(req)
}

View File

@ -0,0 +1,143 @@
package embed
import (
"bytes"
"encoding/json"
"io"
"math"
"net/http"
"sort"
"strings"
"sync"
"time"
"github.com/sourcegraph/sourcegraph/internal/conf"
"github.com/sourcegraph/sourcegraph/internal/httpcli"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/sourcegraph/schema"
)
type EmbeddingAPIRequest struct {
Model string `json:"model"`
Input []string `json:"input"`
}
type EmbeddingAPIResponse struct {
Data []struct {
Index int `json:"index"`
Embedding []float32 `json:"embedding"`
} `json:"data"`
}
type EmbeddingsClient interface {
GetEmbeddingsWithRetries(texts []string, maxRetries int) ([]float32, error)
GetDimensions() (int, error)
}
func NewEmbeddingsClient() EmbeddingsClient {
client := &embeddingsClient{config: conf.Get().Embeddings}
mu := sync.Mutex{}
conf.Watch(func() {
mu.Lock()
defer mu.Unlock()
client.setConfig(conf.Get().Embeddings)
})
return client
}
type embeddingsClient struct {
config *schema.Embeddings
}
func (c *embeddingsClient) isDisabled() bool {
return c.config == nil || !c.config.Enabled
}
func (c *embeddingsClient) setConfig(config *schema.Embeddings) {
c.config = config
}
func (c *embeddingsClient) GetDimensions() (int, error) {
if c.isDisabled() {
return -1, errors.New("embeddings are not configured or disabled")
}
return c.config.Dimensions, nil
}
// GetEmbeddingsWithRetries tries to embed the given texts using the external service specified in the config.
// In case of failure, it retries the embedding procedure up to maxRetries. This due to the OpenAI API which
// often hangs up when downloading large embedding responses.
func (c *embeddingsClient) GetEmbeddingsWithRetries(texts []string, maxRetries int) ([]float32, error) {
if c.isDisabled() {
return nil, errors.New("embeddings are not configured or disabled")
}
embeddings, err := getEmbeddings(texts, c.config)
if err == nil {
return embeddings, nil
}
for i := 0; i < maxRetries; i++ {
embeddings, err = getEmbeddings(texts, c.config)
if err == nil {
return embeddings, nil
} else {
// Exponential delay
delay := time.Duration(int(math.Pow(float64(2), float64(i))))
time.Sleep(delay * time.Second)
}
}
return nil, err
}
func getEmbeddings(texts []string, config *schema.Embeddings) ([]float32, error) {
// Replace newlines, which can negatively affect performance.
augmentedTexts := make([]string, len(texts))
for idx, text := range texts {
augmentedTexts[idx] = strings.ReplaceAll(text, "\n", " ")
}
request := EmbeddingAPIRequest{Model: config.Model, Input: augmentedTexts}
bodyBytes, err := json.Marshal(request)
if err != nil {
return nil, err
}
req, err := http.NewRequest("POST", config.Url, bytes.NewReader(bodyBytes))
if err != nil {
return nil, err
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "Bearer "+config.AccessToken)
resp, err := httpcli.ExternalDoer.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
respBody, _ := io.ReadAll(resp.Body)
return nil, errors.Errorf("embeddings: %s %q: failed with status %d: %s", req.Method, req.URL.String(), resp.StatusCode, string(respBody))
}
var response EmbeddingAPIResponse
if err := json.NewDecoder(resp.Body).Decode(&response); err != nil {
return nil, err
}
// Ensure embedding responses are sorted in the original order.
sort.Slice(response.Data, func(i, j int) bool {
return response.Data[i].Index < response.Data[j].Index
})
embeddings := make([]float32, 0, len(response.Data)*config.Dimensions)
for _, embedding := range response.Data {
embeddings = append(embeddings, embedding.Embedding...)
}
return embeddings, nil
}

View File

@ -0,0 +1,141 @@
package embed
import (
"context"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/split"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/binary"
)
const GET_EMBEDDINGS_MAX_RETRIES = 5
const MAX_CODE_EMBEDDING_VECTORS = 768_000
const MAX_TEXT_EMBEDDING_VECTORS = 128_000
const EMBEDDING_BATCHES = 5
const EMBEDDING_BATCH_SIZE = 512
type readFile func(fileName string) ([]byte, error)
// EmbedRepo embeds file contents from the given file names for a repository.
// It separates the file names into code files and text files and embeds them separately.
// It returns a RepoEmbeddingIndex containing the embeddings and metadata.
func EmbedRepo(
ctx context.Context,
repoName api.RepoName,
revision api.CommitID,
fileNames []string,
client EmbeddingsClient,
splitOptions split.SplitOptions,
readFile readFile,
) (*embeddings.RepoEmbeddingIndex, error) {
codeFileNames, textFileNames := []string{}, []string{}
for _, fileName := range fileNames {
if isValidTextFile(fileName) {
textFileNames = append(textFileNames, fileName)
} else if isValidCodeFile(fileName) {
codeFileNames = append(codeFileNames, fileName)
}
}
codeIndex, err := embedFiles(codeFileNames, client, splitOptions, readFile, MAX_CODE_EMBEDDING_VECTORS)
if err != nil {
return nil, err
}
textIndex, err := embedFiles(textFileNames, client, splitOptions, readFile, MAX_TEXT_EMBEDDING_VECTORS)
if err != nil {
return nil, err
}
return &embeddings.RepoEmbeddingIndex{RepoName: repoName, Revision: revision, CodeIndex: codeIndex, TextIndex: textIndex}, nil
}
// embedFiles embeds file contents from the given file names. Since embedding models can only handle a certain amount of text (tokens) we cannot embed
// entire files. So we split the file contents into chunks and get embeddings for the chunks in batches. Functions returns an EmbeddingIndex containing
// the embeddings and metadata about the chunks the embeddings correspond to.
func embedFiles(
fileNames []string,
client EmbeddingsClient,
splitOptions split.SplitOptions,
readFile readFile,
maxEmbeddingVectors int,
) (*embeddings.EmbeddingIndex[embeddings.RepoEmbeddingRowMetadata], error) {
if len(fileNames) == 0 {
return nil, nil
}
dimensions, err := client.GetDimensions()
if err != nil {
return nil, err
}
index := &embeddings.EmbeddingIndex[embeddings.RepoEmbeddingRowMetadata]{
Embeddings: make([]float32, 0, len(fileNames)*dimensions),
RowMetadata: make([]embeddings.RepoEmbeddingRowMetadata, 0, len(fileNames)),
ColumnDimension: dimensions,
}
// addEmbeddableChunks batches embeddable chunks, gets embeddings for the batches, and appends them to the index above.
addEmbeddableChunks := func(embeddableChunks []split.EmbeddableChunk, batchSize int) error {
// The embeddings API operates with batches up to a certain size, so we can't send all embeddable chunks for embedding at once.
// We batch them according to `batchSize`, and embed one by one.
for i := 0; i < len(embeddableChunks); i += batchSize {
end := min(len(embeddableChunks), i+batchSize)
batch := embeddableChunks[i:end]
batchChunks := make([]string, len(batch))
for idx, chunk := range batch {
batchChunks[idx] = chunk.Content
index.RowMetadata = append(index.RowMetadata, embeddings.RepoEmbeddingRowMetadata{FileName: chunk.FileName, StartLine: chunk.StartLine, EndLine: chunk.EndLine})
}
batchEmbeddings, err := client.GetEmbeddingsWithRetries(batchChunks, GET_EMBEDDINGS_MAX_RETRIES)
if err != nil {
return errors.Wrap(err, "error while getting embeddings")
}
index.Embeddings = append(index.Embeddings, batchEmbeddings...)
}
return nil
}
embeddableChunks := []split.EmbeddableChunk{}
for _, fileName := range fileNames {
// This is a fail-safe measure to prevent producing an extremely large index for large repositories.
if len(index.RowMetadata) > maxEmbeddingVectors {
break
}
contentBytes, err := readFile(fileName)
if err != nil {
return nil, errors.Wrap(err, "error while reading a file")
}
if binary.IsBinary(contentBytes) {
continue
}
content := string(contentBytes)
if !isEmbeddableFile(fileName, content) {
continue
}
embeddableChunks = append(embeddableChunks, split.SplitIntoEmbeddableChunks(content, fileName, splitOptions)...)
if len(embeddableChunks) > EMBEDDING_BATCHES*EMBEDDING_BATCH_SIZE {
err := addEmbeddableChunks(embeddableChunks, EMBEDDING_BATCH_SIZE)
if err != nil {
return nil, err
}
embeddableChunks = []split.EmbeddableChunk{}
}
}
if len(embeddableChunks) > 0 {
err := addEmbeddableChunks(embeddableChunks, EMBEDDING_BATCH_SIZE)
if err != nil {
return nil, err
}
}
return index, nil
}

View File

@ -0,0 +1,151 @@
package embed
import (
"context"
"strings"
"testing"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/split"
"github.com/sourcegraph/sourcegraph/internal/api"
)
func mockFile(lines ...string) []byte {
return []byte(strings.Join(lines, "\n"))
}
func TestEmbedRepo(t *testing.T) {
ctx := context.Background()
repoName := api.RepoName("repo/name")
revision := api.CommitID("deadbeef")
client := NewMockEmbeddingsClient()
splitOptions := split.SplitOptions{ChunkTokensThreshold: 8}
mockFiles := map[string][]byte{
// 2 embedding chunks (based on split options above)
"a.go": mockFile(
strings.Repeat("a", 32),
"",
strings.Repeat("b", 32),
),
// 2 embedding chunks
"b.md": mockFile(
"# "+strings.Repeat("a", 32),
"",
"## "+strings.Repeat("b", 32),
),
// 3 embedding chunks
"c.java": mockFile(
strings.Repeat("a", 32),
"",
strings.Repeat("b", 32),
"",
strings.Repeat("c", 32),
),
// Should be excluded
"autogen.py": mockFile(
"# "+strings.Repeat("a", 32),
"// Do not edit",
),
// Should be excluded
"lines_too_long.c": mockFile(
strings.Repeat("a", 2049),
strings.Repeat("b", 2049),
strings.Repeat("c", 2049),
),
// Should be excluded
"empty.rb": mockFile(""),
// Should be excluded (binary file),
"binary.bin": {0xFF, 0xF, 0xF, 0xF, 0xFF, 0xF, 0xF, 0xA},
}
readFile := func(fileName string) ([]byte, error) {
content, ok := mockFiles[fileName]
if !ok {
return nil, errors.Newf("file %s not found", fileName)
}
return content, nil
}
t.Run("no files", func(t *testing.T) {
index, err := EmbedRepo(ctx, repoName, revision, []string{}, client, splitOptions, readFile)
if err != nil {
t.Fatal(err)
}
if index.CodeIndex != nil {
t.Fatal("expected code index to be nil")
}
if index.TextIndex != nil {
t.Fatal("expected text index to be nil")
}
})
t.Run("code files only", func(t *testing.T) {
index, err := EmbedRepo(ctx, repoName, revision, []string{"a.go"}, client, splitOptions, readFile)
if err != nil {
t.Fatal(err)
}
if index.TextIndex != nil {
t.Fatal("expected text index to be nil")
}
if index.CodeIndex == nil {
t.Fatal("expected code index to be non-nil")
}
if len(index.CodeIndex.RowMetadata) != 2 {
t.Fatal("expected 2 embedding rows")
}
})
t.Run("text files only", func(t *testing.T) {
index, err := EmbedRepo(ctx, repoName, revision, []string{"b.md"}, client, splitOptions, readFile)
if err != nil {
t.Fatal(err)
}
if index.CodeIndex != nil {
t.Fatal("expected code index to be nil")
}
if index.TextIndex == nil {
t.Fatal("expected text index to be non-nil")
}
if len(index.TextIndex.RowMetadata) != 2 {
t.Fatal("expected 2 embedding rows")
}
})
t.Run("mixed code and text files", func(t *testing.T) {
files := []string{"a.go", "b.md", "c.java", "autogen.py", "empty.rb", "lines_too_long.c", "binary.bin"}
index, err := EmbedRepo(ctx, repoName, revision, files, client, splitOptions, readFile)
if err != nil {
t.Fatal(err)
}
if index.CodeIndex == nil {
t.Fatal("expected code index to be non-nil")
}
if index.TextIndex == nil {
t.Fatal("expected text index to be non-nil")
}
if len(index.CodeIndex.RowMetadata) != 5 {
t.Fatalf("expected 5 embedding rows in code index, got %d", len(index.CodeIndex.RowMetadata))
}
if len(index.TextIndex.RowMetadata) != 2 {
t.Fatalf("expected 2 embedding rows in text index, got %d", len(index.TextIndex.RowMetadata))
}
})
}
func NewMockEmbeddingsClient() EmbeddingsClient {
return &mockEmbeddingsClient{}
}
type mockEmbeddingsClient struct{}
func (c *mockEmbeddingsClient) GetDimensions() (int, error) {
return 3, nil
}
func (c *mockEmbeddingsClient) GetEmbeddingsWithRetries(texts []string, maxRetries int) ([]float32, error) {
dimensions, err := c.GetDimensions()
if err != nil {
return nil, err
}
return make([]float32, len(texts)*dimensions), nil
}

View File

@ -0,0 +1,90 @@
package embed
import (
"path/filepath"
"strings"
)
const MIN_EMBEDDABLE_FILE_SIZE = 32
const MAX_LINE_LENGTH = 2048
var autogeneratedFileHeaders = []string{"autogenerated file", "lockfile", "generated by", "do not edit"}
var textFileExtensions = map[string]struct{}{
"md": {},
"markdown": {},
"rst": {},
"txt": {},
}
var excludedCodeFileExtensions = map[string]struct{}{
"sql": {},
"svg": {},
"json": {},
"yml": {},
"yaml": {},
}
var excludedFilePaths = []string{
"/__fixtures__",
"/testdata",
"/mocks",
"/vendor",
}
func isEmbeddableFile(fileName string, content string) bool {
if len(strings.TrimSpace(content)) < MIN_EMBEDDABLE_FILE_SIZE {
return false
}
for _, excludedFilePath := range excludedFilePaths {
if strings.Contains(fileName, excludedFilePath) {
return false
}
}
lines := strings.Split(content, "\n")
fileHeader := strings.ToLower(strings.Join(lines[0:min(5, len(lines))], "\n"))
for _, header := range autogeneratedFileHeaders {
if strings.Contains(fileHeader, header) {
return false
}
}
for _, line := range lines {
if len(line) > MAX_LINE_LENGTH {
return false
}
}
return true
}
func isValidTextFile(fileName string) bool {
ext := strings.TrimPrefix(filepath.Ext(fileName), ".")
_, ok := textFileExtensions[strings.ToLower(ext)]
if ok {
return true
}
basename := strings.ToLower(filepath.Base(fileName))
return strings.HasPrefix(basename, "license")
}
func isValidCodeFile(fileName string) bool {
basename := strings.ToLower(filepath.Base(fileName))
if strings.HasPrefix(basename, "dockerfile") {
return true
}
ext := strings.TrimPrefix(filepath.Ext(fileName), ".")
_, ok := excludedCodeFileExtensions[strings.ToLower(ext)]
return !ok
}
func min(a, b int) int {
if a < b {
return a
}
return b
}

View File

@ -0,0 +1,23 @@
package embeddings
import (
"crypto/md5"
"encoding/hex"
"fmt"
"github.com/sourcegraph/sourcegraph/internal/api"
"github.com/sourcegraph/sourcegraph/internal/lazyregexp"
)
var nonAlphanumericCharsRegexp = lazyregexp.New(`[^0-9a-zA-Z]`)
var CONTEXT_DETECTION_INDEX_NAME = "context_detection.embeddingindex"
type RepoEmbeddingIndexName string
func GetRepoEmbeddingIndexName(repoName api.RepoName) RepoEmbeddingIndexName {
fsSafeRepoName := nonAlphanumericCharsRegexp.ReplaceAllString(string(repoName), "_")
// Add a hash as well to avoid name collisions
hash := md5.Sum([]byte(repoName))
return RepoEmbeddingIndexName(fmt.Sprintf(`%s_%s.embeddingindex`, fsSafeRepoName, hex.EncodeToString(hash[:])))
}

View File

@ -0,0 +1,105 @@
package embeddings
import (
"container/heap"
"sort"
)
type nearestNeighbor struct {
index int
similarity float32
}
type nearestNeighborsHeap struct {
heap []nearestNeighbor
}
func (nn *nearestNeighborsHeap) Len() int { return len(nn.heap) }
func (nn *nearestNeighborsHeap) Less(i, j int) bool {
return nn.heap[i].similarity < nn.heap[j].similarity
}
func (nn *nearestNeighborsHeap) Swap(i, j int) { nn.heap[i], nn.heap[j] = nn.heap[j], nn.heap[i] }
func (nn *nearestNeighborsHeap) Push(x any) {
nn.heap = append(nn.heap, x.(nearestNeighbor))
}
func (nn *nearestNeighborsHeap) Pop() any {
old := nn.heap
n := len(old)
x := old[n-1]
nn.heap = old[0 : n-1]
return x
}
func (nn *nearestNeighborsHeap) Peek() nearestNeighbor {
return nn.heap[0]
}
func newNearestNeighborsHeap() *nearestNeighborsHeap {
nn := &nearestNeighborsHeap{heap: make([]nearestNeighbor, 0)}
heap.Init(nn)
return nn
}
func min(a, b int) int {
if a < b {
return a
}
return b
}
// SimilaritySearch finds the `nResults` most similar rows to a query vector. It uses the cosine similarity metric.
// IMPORTANT: The vectors in the embedding index have to be normalized for similarity search to work correctly.
func (index *EmbeddingIndex[T]) SimilaritySearch(query []float32, nResults int) []*T {
// TODO: Parallelize. Split the rows among N threads, each finds `nResults` within its chunk, combine the heaps, sort, return top `nResults`.
if nResults == 0 {
return []*T{}
}
nRows := len(index.RowMetadata)
nResults = min(nRows, nResults)
nnHeap := newNearestNeighborsHeap()
for i := 0; i < nResults; i++ {
similarity := CosineSimilarity(
index.Embeddings[i*index.ColumnDimension:(i+1)*index.ColumnDimension],
query,
)
heap.Push(nnHeap, nearestNeighbor{i, similarity})
}
for i := nResults; i < nRows; i++ {
similarity := CosineSimilarity(
index.Embeddings[i*index.ColumnDimension:(i+1)*index.ColumnDimension],
query,
)
// Add row if it has greater similarity than the smallest similarity in the heap.
// This way we ensure keep a set of highest similarities in the heap.
if similarity > nnHeap.Peek().similarity {
heap.Pop(nnHeap)
heap.Push(nnHeap, nearestNeighbor{i, similarity})
}
}
sort.Slice(nnHeap.heap, func(i, j int) bool {
return nnHeap.heap[i].similarity > nnHeap.heap[j].similarity
})
results := make([]*T, len(nnHeap.heap))
for idx, neighbor := range nnHeap.heap {
results[idx] = &index.RowMetadata[neighbor.index]
}
return results
}
// TODO: Can potentially inline this for any performance benefits?
func CosineSimilarity(row []float32, query []float32) float32 {
similarity := float32(0.0)
for i := 0; i < len(row); i++ {
similarity += (row[i] * query[i])
}
return similarity
}

View File

@ -0,0 +1,42 @@
package embeddings
import (
"testing"
"github.com/hexops/autogold"
)
func TestSimilaritySearch(t *testing.T) {
embeddings := []float32{
0.26726124, 0.53452248, 0.80178373,
0.45584231, 0.56980288, 0.68376346,
0.50257071, 0.57436653, 0.64616234,
}
index := EmbeddingIndex[RepoEmbeddingRowMetadata]{
Embeddings: embeddings,
ColumnDimension: 3,
RowMetadata: []RepoEmbeddingRowMetadata{
{FileName: "a"},
{FileName: "b"},
{FileName: "c"},
},
}
t.Run("find row with exact match", func(t *testing.T) {
query := embeddings[0:3]
results := index.SimilaritySearch(query, 1)
autogold.Equal(t, results)
})
t.Run("find nearest neighbors", func(t *testing.T) {
query := []float32{0.87006284, 0.48336824, 0.09667365}
results := index.SimilaritySearch(query, 2)
autogold.Equal(t, results)
})
t.Run("request more results then there are rows", func(t *testing.T) {
query := embeddings[0:3]
results := index.SimilaritySearch(query, 5)
autogold.Equal(t, results)
})
}

View File

@ -0,0 +1,83 @@
package split
import (
"strings"
"github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings"
)
var splittableLinePrefixes = []string{
"//",
"#",
"/*",
"func",
"var",
"const",
"fn",
"public",
"private",
"type",
}
func isSplittableLine(line string) bool {
trimmedLine := strings.TrimSpace(line)
if len(trimmedLine) == 0 {
return true
}
for _, prefix := range splittableLinePrefixes {
if strings.HasPrefix(line, prefix) {
return true
}
}
return false
}
type SplitOptions struct {
NoSplitTokensThreshold int
ChunkTokensThreshold int
ChunkEarlySplitTokensThreshold int
}
type EmbeddableChunk struct {
FileName string
StartLine int
EndLine int
Content string
}
// SplitIntoEmbeddableChunks splits the given text into embeddable chunks.
//
// The text is split on newline characters into lines. The lines are then grouped into chunks based on the split options.
// When the token sum of lines in a chunk exceeds the chunk token threshold or an early split token threshold is met
// and the current line is splittable (empty line, or starts with a comment or declaration), a chunk is ended and added to the results.
func SplitIntoEmbeddableChunks(text string, fileName string, splitOptions SplitOptions) []EmbeddableChunk {
// If the text is short enough, embed the entire file rather than splitting it into chunks.
if embeddings.EstimateTokens(text) < splitOptions.NoSplitTokensThreshold {
return []EmbeddableChunk{{FileName: fileName, StartLine: 0, EndLine: strings.Count(text, "\n") + 1, Content: text}}
}
chunks := []EmbeddableChunk{}
startLine, tokensSum := 0, 0
lines := strings.Split(text, "\n")
addChunk := func(endLine int) {
content := strings.Join(lines[startLine:endLine], "\n")
if len(content) > 0 {
chunks = append(chunks, EmbeddableChunk{FileName: fileName, StartLine: startLine, EndLine: endLine, Content: content})
}
startLine, tokensSum = endLine, 0
}
for i := 0; i < len(lines); i++ {
if tokensSum > splitOptions.ChunkTokensThreshold || (tokensSum > splitOptions.ChunkEarlySplitTokensThreshold && isSplittableLine(lines[i])) {
addChunk(i)
}
tokensSum += embeddings.EstimateTokens(lines[i])
}
if tokensSum > 0 {
addChunk(len(lines))
}
return chunks
}

View File

@ -0,0 +1,23 @@
package split
import (
"testing"
"github.com/hexops/autogold"
)
func TestSplitIntoEmbeddableChunks(t *testing.T) {
content := `Line
Line
Line
Line
Line
Line
Line
Line
`
chunks := SplitIntoEmbeddableChunks(content, "", SplitOptions{ChunkTokensThreshold: 4, ChunkEarlySplitTokensThreshold: 1})
autogold.Equal(t, chunks)
}

View File

@ -0,0 +1,16 @@
[]split.EmbeddableChunk{
{
EndLine: 4,
Content: "Line\nLine\nLine\nLine",
},
{
StartLine: 4,
EndLine: 7,
Content: "\nLine\nLine",
},
{
StartLine: 7,
EndLine: 10,
Content: "\nLine\nLine",
},
}

View File

@ -0,0 +1,6 @@
[]*embeddings.RepoEmbeddingRowMetadata{
{
FileName: "c",
},
{FileName: "b"},
}

View File

@ -0,0 +1,3 @@
[]*embeddings.RepoEmbeddingRowMetadata{{
FileName: "a",
}}

View File

@ -0,0 +1,7 @@
[]*embeddings.RepoEmbeddingRowMetadata{
{
FileName: "a",
},
{FileName: "b"},
{FileName: "c"},
}

View File

@ -0,0 +1,9 @@
package embeddings
import "math"
const CHARS_PER_TOKEN = 4
func EstimateTokens(text string) int {
return int(math.Ceil(float64(len(text)) / float64(CHARS_PER_TOKEN)))
}

View File

@ -0,0 +1,39 @@
package embeddings
import "github.com/sourcegraph/sourcegraph/internal/api"
type EmbeddingIndex[T any] struct {
Embeddings []float32 `json:"embeddings"`
ColumnDimension int `json:"columnDimension"`
RowMetadata []T `json:"rowMetadata"`
}
type RepoEmbeddingRowMetadata struct {
FileName string `json:"fileName"`
StartLine int `json:"startLine"`
EndLine int `json:"endLine"`
}
type RepoEmbeddingIndex struct {
RepoName api.RepoName `json:"repoName"`
Revision api.CommitID `json:"revision"`
CodeIndex *EmbeddingIndex[RepoEmbeddingRowMetadata] `json:"codeIndex"`
TextIndex *EmbeddingIndex[RepoEmbeddingRowMetadata] `json:"textIndex"`
}
type ContextDetectionEmbeddingIndex struct {
MessagesWithAdditionalContextMeanEmbedding []float32 `json:"messagesWithAdditionalContextMeanEmbedding"`
MessagesWithoutAdditionalContextMeanEmbedding []float32 `json:"messagesWithoutAdditionalContextMeanEmbedding"`
}
type EmbeddingSearchResults struct {
CodeResults []EmbeddingSearchResult `json:"codeResults"`
TextResults []EmbeddingSearchResult `json:"textResults"`
}
type EmbeddingSearchResult struct {
FileName string `json:"fileName"`
StartLine int `json:"startLine"`
EndLine int `json:"endLine"`
Content string `json:"content"`
}

View File

@ -0,0 +1,82 @@
package embeddings
import (
"context"
"strings"
"github.com/sourcegraph/sourcegraph/lib/errors"
"github.com/sourcegraph/sourcegraph/internal/env"
"github.com/sourcegraph/sourcegraph/internal/observation"
"github.com/sourcegraph/sourcegraph/internal/uploadstore"
)
type EmbeddingsUploadStoreConfig struct {
env.BaseConfig
Backend string
ManageBucket bool
Bucket string
S3Region string
S3Endpoint string
S3UsePathStyle bool
S3AccessKeyID string
S3SecretAccessKey string
S3SessionToken string
GCSProjectID string
GCSCredentialsFile string
GCSCredentialsFileContents string
}
func (c *EmbeddingsUploadStoreConfig) Load() {
c.Backend = strings.ToLower(c.Get("EMBEDDINGS_UPLOAD_BACKEND", "blobstore", "The target file service for embeddings. S3, GCS, and Blobstore are supported."))
c.ManageBucket = c.GetBool("EMBEDDINGS_UPLOAD_MANAGE_BUCKET", "false", "Whether or not the client should manage the target bucket configuration.")
c.Bucket = c.Get("EMBEDDINGS_UPLOAD_BUCKET", "embeddings", "The name of the bucket to store embeddings in.")
if c.Backend != "blobstore" && c.Backend != "s3" && c.Backend != "gcs" {
c.AddError(errors.Errorf("invalid backend %q for EMBEDDINGS_UPLOAD_BACKEND: must be S3, GCS, or Blobstore", c.Backend))
}
if c.Backend == "blobstore" || c.Backend == "s3" {
c.S3Region = c.Get("EMBEDDINGS_UPLOAD_AWS_REGION", "us-east-1", "The target AWS region.")
c.S3Endpoint = c.Get("EMBEDDINGS_UPLOAD_AWS_ENDPOINT", "http://blobstore:9000", "The target AWS endpoint.")
c.S3UsePathStyle = c.GetBool("EMBEDDINGS_UPLOAD_AWS_USE_PATH_STYLE", "false", "Whether to use path calling (vs subdomain calling).")
ec2RoleCredentials := c.GetBool("EMBEDDINGS_UPLOAD_AWS_USE_EC2_ROLE_CREDENTIALS", "false", "Whether to use the EC2 metadata API, or use the provided static credentials.")
if !ec2RoleCredentials {
c.S3AccessKeyID = c.Get("EMBEDDINGS_UPLOAD_AWS_ACCESS_KEY_ID", "AKIAIOSFODNN7EXAMPLE", "An AWS access key associated with a user with access to S3.")
c.S3SecretAccessKey = c.Get("EMBEDDINGS_UPLOAD_AWS_SECRET_ACCESS_KEY", "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY", "An AWS secret key associated with a user with access to S3.")
c.S3SessionToken = c.GetOptional("EMBEDDINGS_UPLOAD_AWS_SESSION_TOKEN", "An optional AWS session token associated with a user with access to S3.")
}
} else if c.Backend == "gcs" {
c.GCSProjectID = c.Get("EMBEDDINGS_UPLOAD_GCP_PROJECT_ID", "", "The project containing the GCS bucket.")
c.GCSCredentialsFile = c.GetOptional("EMBEDDINGS_UPLOAD_GOOGLE_APPLICATION_CREDENTIALS_FILE", "The path to a service account key file with access to GCS.")
c.GCSCredentialsFileContents = c.GetOptional("EMBEDDINGS_UPLOAD_GOOGLE_APPLICATION_CREDENTIALS_FILE_CONTENT", "The contents of a service account key file with access to GCS.")
}
}
var EmbeddingsUploadStoreConfigInst = &EmbeddingsUploadStoreConfig{}
func NewEmbeddingsUploadStore(ctx context.Context, observationCtx *observation.Context, conf *EmbeddingsUploadStoreConfig) (uploadstore.Store, error) {
c := uploadstore.Config{
Backend: conf.Backend,
ManageBucket: conf.ManageBucket,
Bucket: conf.Bucket,
S3: uploadstore.S3Config{
Region: conf.S3Region,
Endpoint: conf.S3Endpoint,
UsePathStyle: conf.S3UsePathStyle,
AccessKeyID: conf.S3AccessKeyID,
SecretAccessKey: conf.S3SecretAccessKey,
SessionToken: conf.S3SessionToken,
},
GCS: uploadstore.GCSConfig{
ProjectID: conf.GCSProjectID,
CredentialsFile: conf.GCSCredentialsFile,
CredentialsFileContents: conf.GCSCredentialsFileContents,
},
}
return uploadstore.CreateLazy(ctx, c, uploadstore.NewOperations(observationCtx, "embeddings", "uploadstore"))
}

1
go.mod
View File

@ -257,6 +257,7 @@ require (
github.com/golang-jwt/jwt v3.2.2+incompatible
github.com/google/go-github/v47 v47.1.0
github.com/grpc-ecosystem/go-grpc-middleware/providers/openmetrics/v2 v2.0.0-rc.3
github.com/hexops/autogold v1.3.1
github.com/hexops/autogold/v2 v2.1.0
github.com/k3a/html2text v1.1.0
github.com/opsgenie/opsgenie-go-sdk-v2 v1.2.13

1
go.sum
View File

@ -1364,6 +1364,7 @@ github.com/hashicorp/yamux v0.0.0-20180604194846-3520598351bb/go.mod h1:+NfK9FKe
github.com/hashicorp/yamux v0.0.0-20181012175058-2f1d1f20f75d/go.mod h1:+NfK9FKeTrX5uv1uIXGdwYDTeHna2qgaIlx54MXqjAM=
github.com/hexops/autogold v0.8.1/go.mod h1:97HLDXyG23akzAoRYJh/2OBs3kd80eHyKPvZw0S5ZBY=
github.com/hexops/autogold v1.3.1 h1:YgxF9OHWbEIUjhDbpnLhgVsjUDsiHDTyDfy2lrfdlzo=
github.com/hexops/autogold v1.3.1/go.mod h1:sQO+mQUCVfxOKPht+ipDSkJ2SCJ7BNJVHZexsXqWMx4=
github.com/hexops/autogold/v2 v2.1.0 h1:5s9J6CROngFPkgowSkV20bIflBrImSdDqIpoXJeZSkU=
github.com/hexops/autogold/v2 v2.1.0/go.mod h1:cYVc0tJn6v9Uf9xMOHvmH6scuTxsVJSxGcKR/yOVPzY=
github.com/hexops/gotextdiff v1.0.3 h1:gitA9+qJrrTCsiCl7+kh75nPqQt1cx4ZkudSTLoUqJM=

18
internal/binary/binary.go Normal file
View File

@ -0,0 +1,18 @@
package binary
import (
"net/http"
"strings"
"unicode/utf8"
)
// IsBinary is a helper to tell if the content of a file is binary or not.
func IsBinary(content []byte) bool {
// We first check if the file is valid UTF8, since we always consider that
// to be non-binary.
//
// Secondly, if the file is not valid UTF8, we check if the detected HTTP
// content type is text, which covers a whole slew of other non-UTF8 text
// encodings for us.
return !utf8.Valid(content) && !strings.HasPrefix(http.DetectContentType(content), "text/")
}

View File

@ -32,6 +32,8 @@ type ServiceConnections struct {
Searchers []string `json:"searchers"`
// Symbols is the addresses of symbol instances that should be talked to.
Symbols []string `json:"symbols"`
// Embeddings is the addresses of embeddings instances that should be talked to.
Embeddings []string `json:"embeddings"`
// Zoekts is the addresses of Zoekt instances to talk to.
Zoekts []string `json:"zoekts"`
// ZoektListTTL is the TTL of the internal cache that Zoekt clients use to

View File

@ -219,6 +219,7 @@ var siteConfigSecrets = []struct {
{readPath: `auth\.unlockAccountLinkSigningKey`, editPaths: []string{"auth.unlockAccountLinkSigningKey"}},
{readPath: `dotcom.srcCliVersionCache.github.token`, editPaths: []string{"dotcom", "srcCliVersionCache", "github", "token"}},
{readPath: `dotcom.srcCliVersionCache.github.webhookSecret`, editPaths: []string{"dotcom", "srcCliVersionCache", "github", "webhookSecret"}},
{readPath: `embeddings.accessToken`, editPaths: []string{"embeddings", "accessToken"}},
}
// UnredactSecrets unredacts unchanged secrets back to their original value for

View File

@ -529,6 +529,15 @@
"Increment": 1,
"CycleOption": "NO"
},
{
"Name": "context_detection_embedding_jobs_id_seq",
"TypeName": "integer",
"StartValue": 1,
"MinimumValue": 1,
"MaximumValue": 2147483647,
"Increment": 1,
"CycleOption": "NO"
},
{
"Name": "critical_and_site_config_id_seq",
"TypeName": "bigint",
@ -952,6 +961,15 @@
"Increment": 1,
"CycleOption": "NO"
},
{
"Name": "repo_embedding_jobs_id_seq",
"TypeName": "integer",
"StartValue": 1,
"MinimumValue": 1,
"MaximumValue": 2147483647,
"Increment": 1,
"CycleOption": "NO"
},
{
"Name": "repo_id_seq",
"TypeName": "bigint",
@ -8297,6 +8315,195 @@
"Constraints": null,
"Triggers": []
},
{
"Name": "context_detection_embedding_jobs",
"Comment": "",
"Columns": [
{
"Name": "cancel",
"Index": 13,
"TypeName": "boolean",
"IsNullable": false,
"Default": "false",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "execution_logs",
"Index": 11,
"TypeName": "json[]",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "failure_message",
"Index": 3,
"TypeName": "text",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "finished_at",
"Index": 6,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "id",
"Index": 1,
"TypeName": "integer",
"IsNullable": false,
"Default": "nextval('context_detection_embedding_jobs_id_seq'::regclass)",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "last_heartbeat_at",
"Index": 10,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "num_failures",
"Index": 9,
"TypeName": "integer",
"IsNullable": false,
"Default": "0",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "num_resets",
"Index": 8,
"TypeName": "integer",
"IsNullable": false,
"Default": "0",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "process_after",
"Index": 7,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "queued_at",
"Index": 4,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "now()",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "started_at",
"Index": 5,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "state",
"Index": 2,
"TypeName": "text",
"IsNullable": true,
"Default": "'queued'::text",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "worker_hostname",
"Index": 12,
"TypeName": "text",
"IsNullable": false,
"Default": "''::text",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
}
],
"Indexes": [
{
"Name": "context_detection_embedding_jobs_pkey",
"IsPrimaryKey": true,
"IsUnique": true,
"IsExclusion": false,
"IsDeferrable": false,
"IndexDefinition": "CREATE UNIQUE INDEX context_detection_embedding_jobs_pkey ON context_detection_embedding_jobs USING btree (id)",
"ConstraintType": "p",
"ConstraintDefinition": "PRIMARY KEY (id)"
}
],
"Constraints": null,
"Triggers": []
},
{
"Name": "critical_and_site_config",
"Comment": "",
@ -19986,6 +20193,221 @@
}
]
},
{
"Name": "repo_embedding_jobs",
"Comment": "",
"Columns": [
{
"Name": "cancel",
"Index": 13,
"TypeName": "boolean",
"IsNullable": false,
"Default": "false",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "execution_logs",
"Index": 11,
"TypeName": "json[]",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "failure_message",
"Index": 3,
"TypeName": "text",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "finished_at",
"Index": 6,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "id",
"Index": 1,
"TypeName": "integer",
"IsNullable": false,
"Default": "nextval('repo_embedding_jobs_id_seq'::regclass)",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "last_heartbeat_at",
"Index": 10,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "num_failures",
"Index": 9,
"TypeName": "integer",
"IsNullable": false,
"Default": "0",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "num_resets",
"Index": 8,
"TypeName": "integer",
"IsNullable": false,
"Default": "0",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "process_after",
"Index": 7,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "queued_at",
"Index": 4,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "now()",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "repo_id",
"Index": 14,
"TypeName": "integer",
"IsNullable": false,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "revision",
"Index": 15,
"TypeName": "text",
"IsNullable": false,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "started_at",
"Index": 5,
"TypeName": "timestamp with time zone",
"IsNullable": true,
"Default": "",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "state",
"Index": 2,
"TypeName": "text",
"IsNullable": true,
"Default": "'queued'::text",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
},
{
"Name": "worker_hostname",
"Index": 12,
"TypeName": "text",
"IsNullable": false,
"Default": "''::text",
"CharacterMaximumLength": 0,
"IsIdentity": false,
"IdentityGeneration": "",
"IsGenerated": "NEVER",
"GenerationExpression": "",
"Comment": ""
}
],
"Indexes": [
{
"Name": "repo_embedding_jobs_pkey",
"IsPrimaryKey": true,
"IsUnique": true,
"IsExclusion": false,
"IsDeferrable": false,
"IndexDefinition": "CREATE UNIQUE INDEX repo_embedding_jobs_pkey ON repo_embedding_jobs USING btree (id)",
"ConstraintType": "p",
"ConstraintDefinition": "PRIMARY KEY (id)"
}
],
"Constraints": null,
"Triggers": []
},
{
"Name": "repo_kvps",
"Comment": "",

View File

@ -1083,6 +1083,28 @@ Indexes:
**transition_columns**: Array of changes that occurred to the upload for this entry, in the form of {&#34;column&#34;=&gt;&#34;&lt;column name&gt;&#34;, &#34;old&#34;=&gt;&#34;&lt;previous value&gt;&#34;, &#34;new&#34;=&gt;&#34;&lt;new value&gt;&#34;}.
# Table "public.context_detection_embedding_jobs"
```
Column | Type | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------+--------------------------------------------------------------
id | integer | | not null | nextval('context_detection_embedding_jobs_id_seq'::regclass)
state | text | | | 'queued'::text
failure_message | text | | |
queued_at | timestamp with time zone | | | now()
started_at | timestamp with time zone | | |
finished_at | timestamp with time zone | | |
process_after | timestamp with time zone | | |
num_resets | integer | | not null | 0
num_failures | integer | | not null | 0
last_heartbeat_at | timestamp with time zone | | |
execution_logs | json[] | | |
worker_hostname | text | | not null | ''::text
cancel | boolean | | not null | false
Indexes:
"context_detection_embedding_jobs_pkey" PRIMARY KEY, btree (id)
```
# Table "public.critical_and_site_config"
```
Column | Type | Collation | Nullable | Default
@ -3067,6 +3089,30 @@ Triggers:
```
# Table "public.repo_embedding_jobs"
```
Column | Type | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------+-------------------------------------------------
id | integer | | not null | nextval('repo_embedding_jobs_id_seq'::regclass)
state | text | | | 'queued'::text
failure_message | text | | |
queued_at | timestamp with time zone | | | now()
started_at | timestamp with time zone | | |
finished_at | timestamp with time zone | | |
process_after | timestamp with time zone | | |
num_resets | integer | | not null | 0
num_failures | integer | | not null | 0
last_heartbeat_at | timestamp with time zone | | |
execution_logs | json[] | | |
worker_hostname | text | | not null | ''::text
cancel | boolean | | not null | false
repo_id | integer | | not null |
revision | text | | not null |
Indexes:
"repo_embedding_jobs_pkey" PRIMARY KEY, btree (id)
```
# Table "public.repo_kvps"
```
Column | Type | Collation | Nullable | Default

View File

@ -0,0 +1,2 @@
DROP TABLE IF EXISTS repo_embedding_jobs;
DROP TABLE IF EXISTS context_detection_embedding_jobs;

View File

@ -0,0 +1,2 @@
name: add repo embedding jobs
parents: [1675647612]

View File

@ -0,0 +1,34 @@
CREATE TABLE IF NOT EXISTS repo_embedding_jobs (
id SERIAL PRIMARY KEY,
state text DEFAULT 'queued',
failure_message text,
queued_at timestamp with time zone DEFAULT NOW(),
started_at timestamp with time zone,
finished_at timestamp with time zone,
process_after timestamp with time zone,
num_resets integer not null default 0,
num_failures integer not null default 0,
last_heartbeat_at timestamp with time zone,
execution_logs json [],
worker_hostname text not null default '',
cancel boolean DEFAULT false NOT NULL,
-- additional columns
repo_id integer not null,
revision text not null
);
CREATE TABLE IF NOT EXISTS context_detection_embedding_jobs (
id SERIAL PRIMARY KEY,
state text DEFAULT 'queued',
failure_message text,
queued_at timestamp with time zone DEFAULT NOW(),
started_at timestamp with time zone,
finished_at timestamp with time zone,
process_after timestamp with time zone,
num_resets integer not null default 0,
num_failures integer not null default 0,
last_heartbeat_at timestamp with time zone,
execution_logs json [],
worker_hostname text not null default '',
cancel boolean DEFAULT false NOT NULL
);

View File

@ -1865,6 +1865,32 @@ CREATE SEQUENCE configuration_policies_audit_logs_seq
ALTER SEQUENCE configuration_policies_audit_logs_seq OWNED BY configuration_policies_audit_logs.sequence;
CREATE TABLE context_detection_embedding_jobs (
id integer NOT NULL,
state text DEFAULT 'queued'::text,
failure_message text,
queued_at timestamp with time zone DEFAULT now(),
started_at timestamp with time zone,
finished_at timestamp with time zone,
process_after timestamp with time zone,
num_resets integer DEFAULT 0 NOT NULL,
num_failures integer DEFAULT 0 NOT NULL,
last_heartbeat_at timestamp with time zone,
execution_logs json[],
worker_hostname text DEFAULT ''::text NOT NULL,
cancel boolean DEFAULT false NOT NULL
);
CREATE SEQUENCE context_detection_embedding_jobs_id_seq
AS integer
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
ALTER SEQUENCE context_detection_embedding_jobs_id_seq OWNED BY context_detection_embedding_jobs.id;
CREATE TABLE critical_and_site_config (
id integer NOT NULL,
type critical_or_site NOT NULL,
@ -3716,6 +3742,34 @@ CREATE SEQUENCE registry_extensions_id_seq
ALTER SEQUENCE registry_extensions_id_seq OWNED BY registry_extensions.id;
CREATE TABLE repo_embedding_jobs (
id integer NOT NULL,
state text DEFAULT 'queued'::text,
failure_message text,
queued_at timestamp with time zone DEFAULT now(),
started_at timestamp with time zone,
finished_at timestamp with time zone,
process_after timestamp with time zone,
num_resets integer DEFAULT 0 NOT NULL,
num_failures integer DEFAULT 0 NOT NULL,
last_heartbeat_at timestamp with time zone,
execution_logs json[],
worker_hostname text DEFAULT ''::text NOT NULL,
cancel boolean DEFAULT false NOT NULL,
repo_id integer NOT NULL,
revision text NOT NULL
);
CREATE SEQUENCE repo_embedding_jobs_id_seq
AS integer
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1;
ALTER SEQUENCE repo_embedding_jobs_id_seq OWNED BY repo_embedding_jobs.id;
CREATE SEQUENCE repo_id_seq
START WITH 1
INCREMENT BY 1
@ -4311,6 +4365,8 @@ ALTER TABLE ONLY codeowners ALTER COLUMN id SET DEFAULT nextval('codeowners_id_s
ALTER TABLE ONLY configuration_policies_audit_logs ALTER COLUMN sequence SET DEFAULT nextval('configuration_policies_audit_logs_seq'::regclass);
ALTER TABLE ONLY context_detection_embedding_jobs ALTER COLUMN id SET DEFAULT nextval('context_detection_embedding_jobs_id_seq'::regclass);
ALTER TABLE ONLY critical_and_site_config ALTER COLUMN id SET DEFAULT nextval('critical_and_site_config_id_seq'::regclass);
ALTER TABLE ONLY discussion_comments ALTER COLUMN id SET DEFAULT nextval('discussion_comments_id_seq'::regclass);
@ -4403,6 +4459,8 @@ ALTER TABLE ONLY registry_extensions ALTER COLUMN id SET DEFAULT nextval('regist
ALTER TABLE ONLY repo ALTER COLUMN id SET DEFAULT nextval('repo_id_seq'::regclass);
ALTER TABLE ONLY repo_embedding_jobs ALTER COLUMN id SET DEFAULT nextval('repo_embedding_jobs_id_seq'::regclass);
ALTER TABLE ONLY roles ALTER COLUMN id SET DEFAULT nextval('roles_id_seq'::regclass);
ALTER TABLE ONLY saved_searches ALTER COLUMN id SET DEFAULT nextval('saved_searches_id_seq'::regclass);
@ -4568,6 +4626,9 @@ ALTER TABLE ONLY codeowners
ALTER TABLE ONLY codeowners
ADD CONSTRAINT codeowners_repo_id_key UNIQUE (repo_id);
ALTER TABLE ONLY context_detection_embedding_jobs
ADD CONSTRAINT context_detection_embedding_jobs_pkey PRIMARY KEY (id);
ALTER TABLE ONLY critical_and_site_config
ADD CONSTRAINT critical_and_site_config_pkey PRIMARY KEY (id);
@ -4781,6 +4842,9 @@ ALTER TABLE ONLY registry_extension_releases
ALTER TABLE ONLY registry_extensions
ADD CONSTRAINT registry_extensions_pkey PRIMARY KEY (id);
ALTER TABLE ONLY repo_embedding_jobs
ADD CONSTRAINT repo_embedding_jobs_pkey PRIMARY KEY (id);
ALTER TABLE ONLY repo_kvps
ADD CONSTRAINT repo_kvps_pkey PRIMARY KEY (repo_id, key) INCLUDE (value);

View File

@ -115,6 +115,10 @@
interfaces:
- UserEmailsService
- ReposService
- filename: enterprise/internal/embeddings/background/repo/mocks_temp.go
path: github.com/sourcegraph/sourcegraph/enterprise/internal/embeddings/background/repo
interfaces:
- RepoEmbeddingJobsStore
- filename: enterprise/internal/executor/store/mocks_temp.go
path: github.com/sourcegraph/sourcegraph/enterprise/internal/executor/store
interfaces:

View File

@ -575,6 +575,20 @@ type EmailTemplates struct {
SetPassword *EmailTemplate `json:"setPassword,omitempty"`
}
// Embeddings description: Configuration for embeddings service.
type Embeddings struct {
// AccessToken description: The access token used to authenticate with the external embedding API service.
AccessToken string `json:"accessToken"`
// Dimensions description: The dimensionality of the embedding vectors.
Dimensions int `json:"dimensions"`
// Enabled description: Toggles whether embedding service is enabled.
Enabled bool `json:"enabled"`
// Model description: The model used for embedding.
Model string `json:"model"`
// Url description: The url to the external embedding API service.
Url string `json:"url"`
}
// EncryptionKey description: Config for a key
type EncryptionKey struct {
Cloudkms *CloudKMSEncryptionKey
@ -2368,6 +2382,8 @@ type SiteConfiguration struct {
EmailSmtp *SMTPServerConfig `json:"email.smtp,omitempty"`
// EmailTemplates description: Configurable templates for some email types sent by Sourcegraph.
EmailTemplates *EmailTemplates `json:"email.templates,omitempty"`
// Embeddings description: Configuration for embeddings service.
Embeddings *Embeddings `json:"embeddings,omitempty"`
// EncryptionKeys description: Configuration for encryption keys used to encrypt data at rest in the database.
EncryptionKeys *EncryptionKeys `json:"encryption.keys,omitempty"`
// ExecutorsAccessToken description: The shared secret between Sourcegraph and executors.
@ -2576,6 +2592,7 @@ func (v *SiteConfiguration) UnmarshalJSON(data []byte) error {
delete(m, "email.address")
delete(m, "email.smtp")
delete(m, "email.templates")
delete(m, "embeddings")
delete(m, "encryption.keys")
delete(m, "executors.accessToken")
delete(m, "executors.batcheshelperImage")

View File

@ -1827,6 +1827,36 @@
"type": "string"
}
}
},
"embeddings": {
"description": "Configuration for embeddings service.",
"type": "object",
"required": ["enabled", "dimensions", "model", "accessToken", "url"],
"properties": {
"enabled": {
"description": "Toggles whether embedding service is enabled.",
"type": "boolean",
"default": false
},
"dimensions": {
"description": "The dimensionality of the embedding vectors.",
"type": "integer",
"minimum": 0
},
"model": {
"description": "The model used for embedding.",
"type": "string"
},
"accessToken": {
"description": "The access token used to authenticate with the external embedding API service.",
"type": "string"
},
"url": {
"description": "The url to the external embedding API service.",
"type": "string",
"format": "uri"
}
}
}
},
"definitions": {

View File

@ -37,6 +37,7 @@ env:
REPO_UPDATER_URL: http://127.0.0.1:3182
REDIS_ENDPOINT: 127.0.0.1:6379
SYMBOLS_URL: http://localhost:3184
EMBEDDINGS_URL: http://localhost:9991
SRC_SYNTECT_SERVER: http://localhost:9238
SRC_FRONTEND_INTERNAL: localhost:3090
GRAFANA_SERVER_URL: http://localhost:3370
@ -370,6 +371,21 @@ commands:
- enterprise/cmd/symbols
- enterprise/internal/rockskip
embeddings:
cmd: .bin/embeddings
install: |
if [ -n "$DELVE" ]; then
export GCFLAGS='all=-N -l'
fi
go build -gcflags="$GCFLAGS" -o .bin/embeddings github.com/sourcegraph/sourcegraph/enterprise/cmd/embeddings
checkBinary: .bin/embeddings
watch:
- lib
- internal
- enterprise/cmd/embeddings
- enterprise/internal/embeddings
searcher:
cmd: .bin/searcher
install: |
@ -1051,6 +1067,33 @@ commandsets:
- batches-executor
- batcheshelper-builder
embeddings:
requiresDevPrivate: true
checks:
- docker
- redis
- postgres
- git
commands:
- embeddings
- frontend
- worker
- repo-updater
- web
- gitserver-0
- gitserver-1
- searcher
- symbols
- caddy
- docsite
- syntax-highlighter
- github-proxy
- zoekt-index-0
- zoekt-index-1
- zoekt-web-0
- zoekt-web-1
- blobstore
iam:
requiresDevPrivate: true
checks: