Add a sitemap for Sourcegraph.com covering over 400k+ Go symbols and packages (#24490)

This PR adds a sitemap generation tool and adds a sitemap to Sourcegraph.com with 405,164 API docs pages and sub-pages, covering a wide variety of Go symbols and packages. Two examples:

* https://sourcegraph.com/github.com/golang/go/-/docs/net/http (page)
* https://sourcegraph.com/github.com/golang/go/-/docs/net/http?CookieJar (sub-page)

The sitemap is generated by a tool which issues approx 1.6 million GraphQL requests in order to discover all the pages and generate static sitemap.xml.gz files, which are then uploaded to [a GCS bucket](https://console.cloud.google.com/storage/browser/sitemap-sourcegraph-com;tab=objects?authuser=0&project=sourcegraph-dev&prefix=&forceOnObjectsSortingFiltering=false).

## Improving the quality of our SEO

Today, we have no sitemap at all (yes, really!) so this is a first, small step in the right direction of ensuring people can find Sourcegraph through Google. Analysis shows that many of the pages Google has indexed today on Sourcegraph are garbage pages, such as empty README files in repositories or very old commits in repositories that just got discovered by accident somehow.

As such, I will begin the process of updating our metadata to instruct Google and others to not index many of the garbage pages they've indexed today.

## Ensuring the pages we ask Google to index are high quality

The pages included in this sitemap are approximately only 30% of our pages. I've eliminated over 70% as they do not meet a criteria that is relatively high quality:

1. Is a public Go symbol/package
2. Has a description with >100 characters of text.
3. Has at least one usage example.

The pages that remain come from just 2,778 repositories, with 6,159,920 GitHub stars total and 24,055 Go packages combined.

Only 6,247 symbols have a usage example from an external repository, and so for now I have chosen not to filter down to just pages with an external usage example. I hope we'll index many more Go repositories very soon to remedy this and then improve our inclusion criteria further to restrict to only symbols that have >=1 external usage example.

Over half a million pages are excluded as they have zero usage examples, and a further half a million or so are excluded as they are not exported/public symbols. 114 Go repos are missing API docs, unclear why yet: https://github.com/sourcegraph/sourcegraph/issues/24539

Once we've indexed more repositories, we will be able to adjust the inclusion criteria to be even higher quality. This is just a small stepping stone in the much larger picture of ensuring what we serve to Google and others is actually high quality. We don't want low-quality content being fed to Google, it's harmful to developers and we haven't invested enough resources in improving this to date. With a bit of effort, we should be able to make big improvements and ensure any Sourcegraph link a developer comes across is high quality and truly useful.

Signed-off-by: Stephen Gutekanst <stephen@sourcegraph.com>
This commit is contained in:
Stephen Gutekanst 2021-09-02 16:51:52 -07:00 committed by GitHub
parent 71593bf836
commit e3c3963ca8
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
13 changed files with 1005 additions and 0 deletions

4
.gitignore vendored
View File

@ -144,3 +144,7 @@ storybook-static/
sg.config.overwrite.yaml
# sg Google Cloud API OAuth token
.sg.token.json
# Generated sitemaps are not committed, they're hosted in a GCS bucket.
sitemap/
sitemap_query.db

View File

@ -7,6 +7,7 @@ import (
"net/http"
"strconv"
"github.com/sourcegraph/sourcegraph/cmd/frontend/envvar"
"github.com/sourcegraph/sourcegraph/cmd/frontend/globals"
"github.com/sourcegraph/sourcegraph/cmd/frontend/internal/app/assetsutil"
"github.com/sourcegraph/sourcegraph/internal/env"
@ -25,6 +26,9 @@ func robotsTxtHelper(w io.Writer, allowRobots bool) {
fmt.Fprintln(&buf, "User-agent: *")
if allowRobots {
fmt.Fprintln(&buf, "Allow: /")
if envvar.SourcegraphDotComMode() {
fmt.Fprintln(&buf, "Sitemap: https://storage.googleapis.com/sitemap-sourcegraph-com/sitemap.xml.gz")
}
} else {
fmt.Fprintln(&buf, "Disallow: /")
}

16
cmd/sitemap/README.md Normal file
View File

@ -0,0 +1,16 @@
# Sourcegraph sitemap generator
This tool is ran offline to generate the sitemap files served at https://sourcegraph.com/sitemap.xml
To run it:
```sh
export SRC_ACCESS_TOKEN=...
./run.sh
```
Once ran, it will output some stats as well as generate the sitemap.xml files to `sitemap/`. You should then upload them:
```sh
gsutil cp -r sitemap/ gs://sitemap-sourcegraph-com
```

116
cmd/sitemap/graphql.go Normal file
View File

@ -0,0 +1,116 @@
package main
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io/ioutil"
"net/http"
"strconv"
"time"
"github.com/sourcegraph/sourcegraph/internal/env"
"github.com/sourcegraph/sourcegraph/internal/httpcli"
"github.com/cockroachdb/errors"
)
// This file contains all the methods required to execute Sourcegraph GraphQL API requests.
var (
graphQLTimeout, _ = time.ParseDuration(env.Get("GRAPHQL_TIMEOUT", "30s", "Timeout for GraphQL HTTP requests"))
graphQLRetryDelayBase, _ = time.ParseDuration(env.Get("GRAPHQL_RETRY_DELAY_BASE", "200ms", "Base retry delay duration for GraphQL HTTP requests"))
graphQLRetryDelayMax, _ = time.ParseDuration(env.Get("GRAPHQL_RETRY_DELAY_MAX", "3s", "Max retry delay duration for GraphQL HTTP requests"))
graphQLRetryMaxAttempts, _ = strconv.Atoi(env.Get("GRAPHQL_RETRY_MAX_ATTEMPTS", "20", "Max retry attempts for GraphQL HTTP requests"))
)
// graphQLQuery describes a general GraphQL query and its variables.
type graphQLQuery struct {
Query string `json:"query"`
Variables interface{} `json:"variables"`
}
type graphQLClient struct {
URL string
Token string
factory *httpcli.Factory
}
// requestGraphQL performs a GraphQL request with the given query and variables.
// search executes the given search query. The queryName is used as the source of the request.
// The result will be decoded into the given pointer.
func (c *graphQLClient) requestGraphQL(ctx context.Context, queryName string, query string, variables interface{}) ([]byte, error) {
var buf bytes.Buffer
err := json.NewEncoder(&buf).Encode(graphQLQuery{
Query: query,
Variables: variables,
})
if err != nil {
return nil, errors.Wrap(err, "Encode")
}
req, err := http.NewRequest("POST", c.URL+"?"+queryName, &buf)
if err != nil {
return nil, errors.Wrap(err, "Post")
}
if c.Token != "" {
req.Header.Set("Authorization", "token "+c.Token)
}
req.Header.Set("Content-Type", "application/json")
if c.factory == nil {
c.factory = httpcli.NewFactory(
httpcli.NewMiddleware(
httpcli.ContextErrorMiddleware,
),
httpcli.NewMaxIdleConnsPerHostOpt(500),
httpcli.NewTimeoutOpt(graphQLTimeout),
// ExternalTransportOpt needs to be before TracedTransportOpt and
// NewCachedTransportOpt since it wants to extract a http.Transport,
// not a generic http.RoundTripper.
httpcli.ExternalTransportOpt,
httpcli.NewErrorResilientTransportOpt(
httpcli.NewRetryPolicy(httpcli.MaxRetries(graphQLRetryMaxAttempts)),
httpcli.ExpJitterDelay(graphQLRetryDelayBase, graphQLRetryDelayMax),
),
httpcli.TracedTransportOpt,
)
}
doer, err := c.factory.Doer()
if err != nil {
return nil, errors.Wrap(err, "Doer")
}
resp, err := doer.Do(req.WithContext(ctx))
if err != nil {
return nil, errors.Wrap(err, "Post")
}
defer resp.Body.Close()
data, err := ioutil.ReadAll(resp.Body)
if err != nil {
return nil, errors.Wrap(err, "ReadAll")
}
var errs struct {
Errors []interface{}
}
if err := json.Unmarshal(data, &errs); err != nil {
return nil, errors.Wrap(err, "Unmarshal errors")
}
if len(errs.Errors) > 0 {
return nil, fmt.Errorf("graphql error: %v", errs.Errors)
}
return data, nil
}
func strPtr(v string) *string {
return &v
}
func intPtr(v int) *int {
return &v
}

View File

@ -0,0 +1,60 @@
package main
import "github.com/sourcegraph/sourcegraph/lib/codeintel/lsif/protocol"
const gqlDocPageQuery = `
query DocumentationPage($repoName: String!, $pathID: String!) {
repository(name: $repoName) {
commit(rev: "HEAD") {
tree(path: "/") {
lsif {
documentationPage(pathID: $pathID) {
tree
}
}
}
}
}
}
`
type gqlDocPageVars struct {
RepoName string `json:"repoName"`
PathID string `json:"pathID"`
}
type gqlDocPageResponse struct {
Data struct {
Repository struct {
Commit struct {
Tree struct {
LSIF struct {
DocumentationPage struct {
Tree string
}
}
}
}
}
}
Errors []interface{}
}
// DocumentationNodeChild represents a child of a node.
type DocumentationNodeChild struct {
// Node is non-nil if this child is another (non-new-page) node.
Node *DocumentationNode `json:"node,omitempty"`
// PathID is a non-empty string if this child is itself a new page.
PathID string `json:"pathID,omitempty"`
}
// DocumentationNode describes one node in a tree of hierarchial documentation.
type DocumentationNode struct {
// PathID is the path ID of this node itself.
PathID string `json:"pathID"`
Documentation protocol.Documentation `json:"documentation"`
Label protocol.MarkupContent `json:"label"`
Detail protocol.MarkupContent `json:"detail"`
Children []DocumentationNodeChild `json:"children"`
}

View File

@ -0,0 +1,48 @@
package main
const gqlDocPathInfoQuery = `
query DocumentationPathInfo($repoName: String!) {
repository(name: $repoName) {
commit(rev: "HEAD") {
tree(path: "/") {
lsif {
documentationPathInfo(pathID: "/")
}
}
}
}
}
`
type gqlDocPathInfoVars struct {
RepoName string `json:"repoName"`
}
type gqlDocPathInfoResponse struct {
Data struct {
Repository struct {
Commit struct {
Tree struct {
LSIF struct {
DocumentationPathInfo string
}
}
}
}
}
Errors []interface{}
}
// DocumentationPathInfoResult describes a single documentation page path, what is located there
// and what pages are below it.
type DocumentationPathInfoResult struct {
// The pathID for this page/entry.
PathID string `json:"pathID"`
// IsIndex tells if the page at this path is an empty index page whose only purpose is to describe
// all the pages below it.
IsIndex bool `json:"isIndex"`
// Children is a list of the children page paths immediately below this one.
Children []DocumentationPathInfoResult `json:"children"`
}

View File

@ -0,0 +1,105 @@
package main
const gqlDocReferencesQuery = `
query DocReferences(
$repoName: String!
$pathID: String!
$first: Int
$after: String
) {
repository(name: $repoName) {
commit(rev: "HEAD") {
tree(path: "/") {
lsif {
documentationReferences(pathID: $pathID, first: $first, after: $after) {
nodes {
resource {
repository {
name
url
}
commit {
oid
}
path
name
}
range {
start {
line
character
}
end {
line
character
}
}
url
}
pageInfo {
endCursor
hasNextPage
}
}
}
}
}
}
}
`
type gqlDocReferencesVars struct {
RepoName string `json:"repoName"`
PathID string `json:"pathID"`
First *int `json:"first,omitempty"`
After *string `json:"after,omitempty"`
}
type gqlDocReferencesResponse struct {
Data struct {
Repository struct {
Commit struct {
Tree struct {
LSIF struct {
DocumentationReferences struct {
Nodes []DocumentationReference
PageInfo struct {
EndCursor *string
HasNextPage bool
}
}
DocumentationPage struct {
Tree string
}
}
}
}
}
}
Errors []interface{}
}
type DocumentationReference struct {
Resource struct {
Repository struct {
Name string
URL string
}
Commit struct {
OID string
}
Path string
Name string
}
Range struct {
Start struct {
Line int
Character int
}
End struct {
Line int
Character int
}
}
URL string
}

View File

@ -0,0 +1,85 @@
package main
const gqlLSIFIndexesQuery = `
query LsifIndexes($state: LSIFIndexState, $first: Int, $after: String, $query: String) {
lsifIndexes(state: $state, first: $first, after: $after, query: $query) {
nodes {
...LsifIndexFields
}
totalCount
pageInfo {
endCursor
hasNextPage
}
}
}
fragment LsifIndexFields on LSIFIndex {
__typename
id
inputCommit
inputRoot
inputIndexer
projectRoot {
url
path
repository {
url
name
stars
}
commit {
url
oid
abbreviatedOID
}
}
state
failure
queuedAt
startedAt
finishedAt
placeInQueue
associatedUpload {
id
state
uploadedAt
startedAt
finishedAt
placeInQueue
}
}
`
type gqlLSIFIndexesVars struct {
State *string `json:"state"`
First *int `json:"first"`
After *string `json:"after"`
Query *string `json:"query"`
}
type gqlLSIFIndex struct {
InputIndexer string
ProjectRoot struct {
URL string
Repository struct {
URL string
Name string
Stars uint64
}
}
}
type gqlLSIFIndexesResponse struct {
Data struct {
LsifIndexes struct {
Nodes []gqlLSIFIndex
TotalCount uint64
PageInfo struct {
EndCursor *string
HasNextPage bool
}
}
}
Errors []interface{}
}

434
cmd/sitemap/main.go Normal file
View File

@ -0,0 +1,434 @@
package main
import (
"compress/gzip"
"context"
"encoding/json"
"fmt"
"os"
"path/filepath"
"sort"
"strconv"
"strings"
"sync"
"time"
"github.com/cockroachdb/errors"
"github.com/inconshreveable/log15"
"github.com/snabb/sitemap"
"github.com/sourcegraph/sourcegraph/lib/codeintel/lsif/protocol"
)
func main() {
gen := &generator{
graphQLURL: "https://sourcegraph.com/.api/graphql",
token: os.Getenv("SRC_ACCESS_TOKEN"),
outDir: "sitemap/",
queryDatabase: "sitemap_query.db",
progressUpdates: 10 * time.Second,
}
if err := gen.generate(context.Background()); err != nil {
log15.Error("failed to generate", "error", err)
os.Exit(-1)
}
log15.Info("generated sitemap", "out", gen.outDir)
}
type generator struct {
graphQLURL string
token string
outDir string
queryDatabase string
progressUpdates time.Duration
db *queryDatabase
gqlClient *graphQLClient
}
// generate generates the sitemap files to the specified directory.
func (g *generator) generate(ctx context.Context) error {
if err := os.MkdirAll(g.outDir, 0700); err != nil {
return errors.Wrap(err, "MkdirAll")
}
if err := os.MkdirAll(filepath.Dir(g.queryDatabase), 0700); err != nil {
return errors.Wrap(err, "MkdirAll")
}
// The query database caches our GraphQL queries across multiple runs, as well as allows us to
// update the sitemap to include new repositories / pages without re-querying everything which
// would be very expensive. It's a simple on-disk key-vaue store (bbolt).
var err error
g.db, err = openQueryDatabase(g.queryDatabase)
if err != nil {
return errors.Wrap(err, "openQueryDatabase")
}
defer g.db.close()
g.gqlClient = &graphQLClient{
URL: g.graphQLURL,
Token: g.token,
}
// Provide ability to clear specific cache keys (i.e. specific types of GraphQL requests.)
clearCacheKeys := strings.Fields(os.Getenv("CLEAR_CACHE_KEYS"))
if len(clearCacheKeys) > 0 {
for _, key := range clearCacheKeys {
log15.Info("clearing cache key", "key", key)
if err := g.db.delete(key); err != nil {
log15.Info("failed to clear cache key", "key", key, "error", err)
}
}
}
listCacheKeys, _ := strconv.ParseBool(os.Getenv("LIST_CACHE_KEYS"))
if listCacheKeys {
keys, err := g.db.keys()
if err != nil {
log15.Info("failed to list cache keys", "error", err)
}
for _, key := range keys {
log15.Info("listing cache keys", "key", key)
}
}
// Build a set of Go repos that have LSIF indexes.
indexedGoRepos := map[string][]gqlLSIFIndex{}
lastUpdate := time.Now()
queried := 0
if err := g.eachLsifIndex(ctx, func(each gqlLSIFIndex, total uint64) error {
if time.Since(lastUpdate) >= g.progressUpdates {
lastUpdate = time.Now()
log15.Info("progress: discovered LSIF indexes", "n", queried, "of", total)
}
queried++
if strings.Contains(each.InputIndexer, "lsif-go") {
repoName := each.ProjectRoot.Repository.Name
indexedGoRepos[repoName] = append(indexedGoRepos[repoName], each)
}
return nil
}); err != nil {
return err
}
// Fetch documentation path info for each chosen repo with LSIF indexes.
var (
pagesByRepo = map[string][]string{}
totalPages = 0
totalStars uint64
missingAPIDocs = 0
)
lastUpdate = time.Now()
queried = 0
for repoName, indexes := range indexedGoRepos {
if time.Since(lastUpdate) >= g.progressUpdates {
lastUpdate = time.Now()
log15.Info("progress: discovered API docs pages for repo", "n", queried, "of", len(indexedGoRepos))
}
totalStars += indexes[0].ProjectRoot.Repository.Stars
pathInfo, err := g.fetchDocPathInfo(ctx, gqlDocPathInfoVars{RepoName: repoName})
queried++
if pathInfo == nil || (err != nil && strings.Contains(err.Error(), "page not found")) {
//log15.Error("no API docs pages found", "repo", repoName, "pathInfo==nil", pathInfo == nil, "error", err)
if err != nil {
missingAPIDocs++
}
continue
}
if err != nil {
return errors.Wrap(err, "fetchDocPathInfo")
}
var walk func(node DocumentationPathInfoResult)
walk = func(node DocumentationPathInfoResult) {
pagesByRepo[repoName] = append(pagesByRepo[repoName], node.PathID)
for _, child := range node.Children {
walk(child)
}
}
walk(*pathInfo)
totalPages += len(pagesByRepo[repoName])
}
// Fetch all documentation pages.
queried = 0
unexpectedMissingPages := 0
var docsSubPagesByRepo [][2]string
for repoName, pagePathIDs := range pagesByRepo {
for _, pathID := range pagePathIDs {
page, err := g.fetchDocPage(ctx, gqlDocPageVars{RepoName: repoName, PathID: pathID})
if page == nil || (err != nil && strings.Contains(err.Error(), "page not found")) {
log15.Error("unexpected: API docs page missing after reportedly existing", "repo", repoName, "pathID", pathID, "error", err)
unexpectedMissingPages++
continue
}
if err != nil {
return err
}
queried++
if time.Since(lastUpdate) >= g.progressUpdates {
lastUpdate = time.Now()
log15.Info("progress: got API docs page", "n", queried, "of", totalPages)
}
var walk func(node *DocumentationNode)
walk = func(node *DocumentationNode) {
goodDetail := len(node.Detail.String()) > 100
goodTags := !nodeIsExcluded(node, protocol.TagPrivate)
if goodDetail && goodTags {
docsSubPagesByRepo = append(docsSubPagesByRepo, [2]string{repoName, node.PathID})
}
for _, child := range node.Children {
if child.Node != nil {
walk(child.Node)
}
}
}
walk(page)
}
}
var (
mu sync.Mutex
docsSubPages []string
workers = 300
index = 0
subPagesWithZeroReferences = 0
subPagesWithOneOrMoreExternalReference = 0
)
queried = 0
for i := 0; i < workers; i++ {
go func() {
for {
mu.Lock()
if index >= len(docsSubPagesByRepo) {
mu.Unlock()
return
}
pair := docsSubPagesByRepo[index]
repoName, pathID := pair[0], pair[1]
index++
if time.Since(lastUpdate) >= g.progressUpdates {
lastUpdate = time.Now()
log15.Info("progress: got API docs usage examples", "n", index, "of", len(docsSubPagesByRepo))
}
mu.Unlock()
references, err := g.fetchDocReferences(ctx, gqlDocReferencesVars{
RepoName: repoName,
PathID: pathID,
First: intPtr(3),
})
if err != nil {
log15.Error("unexpected: error getting references", "repo", repoName, "pathID", pathID, "error", err)
} else {
refs := references.Data.Repository.Commit.Tree.LSIF.DocumentationReferences.Nodes
if len(refs) >= 1 {
externalReferences := 0
for _, ref := range refs {
if ref.Resource.Repository.Name != repoName {
externalReferences++
}
}
// TODO(apidocs): it would be great if more repos had external usage examples. In practice though, less than 2%
// do today. This is because we haven't indexed many repos yet.
if externalReferences > 0 {
subPagesWithOneOrMoreExternalReference++
}
mu.Lock()
docsSubPages = append(docsSubPages, repoName+"/-/docs"+pathID)
mu.Unlock()
} else {
subPagesWithZeroReferences++
}
}
}
}()
}
for {
time.Sleep(1 * time.Second)
mu.Lock()
if index >= len(docsSubPagesByRepo) {
mu.Unlock()
break
}
mu.Unlock()
}
log15.Info("found Go API docs pages", "count", totalPages)
log15.Info("found Go API docs sub-pages", "count", len(docsSubPages))
log15.Info("Go API docs sub-pages with 1+ external reference", "count", subPagesWithOneOrMoreExternalReference)
log15.Info("Go API docs sub-pages with 0 references", "count", subPagesWithZeroReferences)
log15.Info("spanning", "repositories", len(indexedGoRepos), "stars", totalStars)
log15.Info("Go repos missing API docs", "count", missingAPIDocs)
sort.Strings(docsSubPages)
var (
sitemapIndex = sitemap.NewSitemapIndex()
addedURLs = 0
sm = sitemap.New()
sitemaps []*sitemap.Sitemap
)
for _, docSubPage := range docsSubPages {
if addedURLs >= 50000 {
addedURLs = 0
url := &sitemap.URL{
Loc: fmt.Sprintf("https://storage.googleapis.com/sitemap-sourcegraph-com/sitemap_%03d.xml.gz", len(sitemaps)),
ChangeFreq: sitemap.Weekly,
Priority: float32(999 - len(sitemaps)),
}
sitemapIndex.Add(url)
sitemaps = append(sitemaps, sm)
sm = sitemap.New()
}
addedURLs++
sm.Add(&sitemap.URL{
Loc: "https://sourcegraph.com" + docSubPage,
ChangeFreq: sitemap.Weekly,
})
}
sitemaps = append(sitemaps, sm)
{
outFile, err := os.Create(filepath.Join(g.outDir, "sitemap.xml.gz"))
if err != nil {
return errors.Wrap(err, "failed to create sitemap.xml.gz file")
}
defer outFile.Close()
writer := gzip.NewWriter(outFile)
defer writer.Close()
_, err = sitemapIndex.WriteTo(writer)
if err != nil {
return errors.Wrap(err, "failed to write sitemap.xml.gz")
}
}
for index, sitemap := range sitemaps {
fileName := fmt.Sprintf("sitemap_%03d.xml.gz", index)
outFile, err := os.Create(filepath.Join(g.outDir, fileName))
if err != nil {
return errors.Wrap(err, fmt.Sprintf("failed to create %s file", fileName))
}
defer outFile.Close()
writer := gzip.NewWriter(outFile)
defer writer.Close()
_, err = sitemap.WriteTo(writer)
if err != nil {
return errors.Wrap(err, fmt.Sprintf("failed to write %s", fileName))
}
}
log15.Info("To upload the sitemap, use: $ gsutil cp -r sitemap/ gs://sitemap-sourcegraph-com")
return nil
}
func (g *generator) eachLsifIndex(ctx context.Context, each func(index gqlLSIFIndex, total uint64) error) error {
var (
hasNextPage = true
cursor *string
)
for hasNextPage {
retries := 0
retry:
lsifIndexes, err := g.fetchLsifIndexes(ctx, gqlLSIFIndexesVars{
State: strPtr("COMPLETED"),
First: intPtr(5000),
After: cursor,
})
if err != nil {
retries++
if maxRetries := 10; retries < maxRetries {
log15.Error("error listing LSIF indexes", "retry", retries, "of", maxRetries)
goto retry
}
return err
}
for _, index := range lsifIndexes.Data.LsifIndexes.Nodes {
if err := each(index, lsifIndexes.Data.LsifIndexes.TotalCount); err != nil {
return err
}
}
hasNextPage = lsifIndexes.Data.LsifIndexes.PageInfo.HasNextPage
cursor = lsifIndexes.Data.LsifIndexes.PageInfo.EndCursor
}
return nil
}
func (g *generator) fetchLsifIndexes(ctx context.Context, vars gqlLSIFIndexesVars) (*gqlLSIFIndexesResponse, error) {
data, err := g.db.request(requestKey{RequestName: "LsifIndexes", Vars: vars}, func() ([]byte, error) {
return g.gqlClient.requestGraphQL(ctx, "SitemapLsifIndexes", gqlLSIFIndexesQuery, vars)
})
if err != nil {
return nil, err
}
var resp gqlLSIFIndexesResponse
return &resp, json.Unmarshal(data, &resp)
}
func (g *generator) fetchDocPathInfo(ctx context.Context, vars gqlDocPathInfoVars) (*DocumentationPathInfoResult, error) {
data, err := g.db.request(requestKey{RequestName: "DocPathInfo", Vars: vars}, func() ([]byte, error) {
return g.gqlClient.requestGraphQL(ctx, "SitemapDocPathInfo", gqlDocPathInfoQuery, vars)
})
if err != nil {
return nil, err
}
var resp gqlDocPathInfoResponse
if err := json.Unmarshal(data, &resp); err != nil {
return nil, errors.Wrap(err, "Unmarshal GraphQL response")
}
payload := resp.Data.Repository.Commit.Tree.LSIF.DocumentationPathInfo
if payload == "" {
return nil, nil
}
var result DocumentationPathInfoResult
if err := json.Unmarshal([]byte(payload), &result); err != nil {
return nil, errors.Wrap(err, "Unmarshal DocumentationPathInfoResult")
}
return &result, nil
}
func (g *generator) fetchDocPage(ctx context.Context, vars gqlDocPageVars) (*DocumentationNode, error) {
data, err := g.db.request(requestKey{RequestName: "DocPage", Vars: vars}, func() ([]byte, error) {
return g.gqlClient.requestGraphQL(ctx, "SitemapDocPage", gqlDocPageQuery, vars)
})
if err != nil {
return nil, err
}
var resp gqlDocPageResponse
if err := json.Unmarshal(data, &resp); err != nil {
return nil, errors.Wrap(err, "Unmarshal GraphQL response")
}
payload := resp.Data.Repository.Commit.Tree.LSIF.DocumentationPage.Tree
if payload == "" {
return nil, nil
}
var result DocumentationNode
if err := json.Unmarshal([]byte(payload), &result); err != nil {
return nil, errors.Wrap(err, "Unmarshal DocumentationNode")
}
return &result, nil
}
func (g *generator) fetchDocReferences(ctx context.Context, vars gqlDocReferencesVars) (*gqlDocReferencesResponse, error) {
data, err := g.db.request(requestKey{RequestName: "DocReferences", Vars: vars}, func() ([]byte, error) {
return g.gqlClient.requestGraphQL(ctx, "SitemapDocReferences", gqlDocReferencesQuery, vars)
})
if err != nil {
return nil, err
}
var resp gqlDocReferencesResponse
return &resp, json.Unmarshal(data, &resp)
}
func nodeIsExcluded(node *DocumentationNode, excludingTags ...protocol.Tag) bool {
for _, tag := range node.Documentation.Tags {
for _, excludedTag := range excludingTags {
if tag == excludedTag {
return true
}
}
}
return false
}

View File

@ -0,0 +1,112 @@
package main
import (
"encoding/json"
"time"
"github.com/cockroachdb/errors"
"go.etcd.io/bbolt"
bolt "go.etcd.io/bbolt"
)
type requestKey struct {
RequestName string
Vars interface{}
}
type requestValue struct {
Time time.Time
Response []byte
}
// queryDatabase is a bolt DB key-value store which contains all of the GraphQL queries and
// responses that we need to make in order to generate the sitemap. This is basically just a
// glorified HTTP query disk cache.
type queryDatabase struct {
handle *bolt.DB
}
// request performs a request to fetch `key`. If it already exists in the cache, the cached value
// is returned. Otherwise, fetch is invoked and the result is stored and returned if not an error.
func (db *queryDatabase) request(key requestKey, fetch func() ([]byte, error)) ([]byte, error) {
// Our key (i.e. the info needed to perform the request) will be the key in our bucket, as a
// JSON string.
keyBytes, err := json.Marshal(key)
if err != nil {
return nil, errors.Wrap(err, "Marshal")
}
// Check if the bucket already has the request response or not.
var value []byte
err = db.handle.View(func(tx *bolt.Tx) error {
bucket := tx.Bucket([]byte("request-" + key.RequestName))
if bucket != nil {
value = bucket.Get(keyBytes)
}
return nil
})
if err != nil {
return nil, errors.Wrap(err, "View")
}
if value != nil {
var rv requestValue
if err := json.Unmarshal(value, &rv); err != nil {
return nil, errors.Wrap(err, "Unmarshal")
}
return value, nil
}
// Fetch and store the result.
result, err := fetch()
if err != nil {
return nil, errors.Wrap(err, "fetch")
}
err = db.handle.Update(func(tx *bolt.Tx) error {
bucket, err := tx.CreateBucketIfNotExists([]byte("request-" + key.RequestName))
if err != nil {
return errors.Wrap(err, "CreateBucketIfNotExists")
}
bucket.Put(keyBytes, result)
return nil
})
if err != nil {
return nil, errors.Wrap(err, "Update")
}
return result, nil
}
// keys returns a list of all bucket names, e.g. distinct GraphQL query types.
func (db *queryDatabase) keys() ([]string, error) {
var keys []string
if err := db.handle.View(func(tx *bolt.Tx) error {
return tx.ForEach(func(name []byte, b *bbolt.Bucket) error {
keys = append(keys, string(name))
return nil
})
}); err != nil {
return nil, err
}
return keys, nil
}
// delete deletes the bucket with the given key, e.g. a distinct GraphQL query type.
func (db *queryDatabase) delete(key string) error {
return db.handle.Update(func(tx *bolt.Tx) error {
return tx.DeleteBucket([]byte(key))
})
}
func (db *queryDatabase) close() error {
return db.handle.Close()
}
func openQueryDatabase(path string) (*queryDatabase, error) {
db := &queryDatabase{}
var err error
db.handle, err = bolt.Open(path, 0666, nil)
if err != nil {
return nil, errors.Wrap(err, "bolt.Open")
}
return db, nil
}

12
cmd/sitemap/run.sh Executable file
View File

@ -0,0 +1,12 @@
#!/usr/bin/env bash
cd "$(dirname "${BASH_SOURCE[0]}")"/../..
set -ex
ulimit -n 10000
export CGO_ENABLED=0
mkdir -p .bin
go build -o .bin/sitemap-generator ./cmd/sitemap
LIST_CACHE_KEYS=true ./.bin/sitemap-generator

2
go.mod
View File

@ -144,6 +144,7 @@ require (
github.com/shurcooL/highlight_go v0.0.0-20191220051317-782971ddf21b // indirect
github.com/shurcooL/httpgzip v0.0.0-20190720172056-320755c1c1b0
github.com/shurcooL/octicon v0.0.0-20191102190552-cbb32d6a785c // indirect
github.com/snabb/sitemap v1.0.0
github.com/sourcegraph/annotate v0.0.0-20160123013949-f4cad6c6324d // indirect
github.com/sourcegraph/batch-change-utils v0.0.0-20210708162152-c9f35b905d94
github.com/sourcegraph/ctxvfs v0.0.0-20180418081416-2b65f1b1ea81
@ -178,6 +179,7 @@ require (
github.com/xeonx/timeago v1.0.0-rc4
github.com/xhit/go-str2duration/v2 v2.0.0
github.com/zenazn/goji v1.0.1 // indirect
go.etcd.io/bbolt v1.3.6
go.uber.org/atomic v1.9.0
go.uber.org/automaxprocs v1.4.0
go.uber.org/ratelimit v0.2.0

7
go.sum
View File

@ -1391,6 +1391,10 @@ github.com/sirupsen/logrus v1.7.0 h1:ShrD1U9pZB12TX0cVy0DtePoCH97K8EtX+mg7ZARUtM
github.com/sirupsen/logrus v1.7.0/go.mod h1:yWOB1SBYBC5VeMP7gHvWumXLIWorT60ONWic61uBYv0=
github.com/smartystreets/assertions v0.0.0-20180927180507-b2de0cb4f26d/go.mod h1:OnSkiWE9lh6wB0YB77sQom3nweQdgAjqCqsofrRNTgc=
github.com/smartystreets/goconvey v1.6.4/go.mod h1:syvi0/a8iFYH4r/RixwvyeAJjdLS9QV7WQ/tjFTllLA=
github.com/snabb/diagio v1.0.0 h1:kovhQ1rDXoEbmpf/T5N2sUp2iOdxEg+TcqzbYVHV2V0=
github.com/snabb/diagio v1.0.0/go.mod h1:ZyGaWFhfBVqstGUw6laYetzeTwZ2xxVPqTALx1QQa1w=
github.com/snabb/sitemap v1.0.0 h1:7vJeNPAaaj7fQSRS3WYuJHzUjdnhLdSLLpvVtnhbzC0=
github.com/snabb/sitemap v1.0.0/go.mod h1:Id8uz1+WYdiNmSjEi4BIvL5UwNPYLsTHzRbjmDwNDzA=
github.com/snowflakedb/glog v0.0.0-20180824191149-f5055e6f21ce/go.mod h1:EB/w24pR5VKI60ecFnKqXzxX3dOorz1rnVicQTQrGM0=
github.com/snowflakedb/gosnowflake v1.3.5/go.mod h1:13Ky+lxzIm3VqNDZJdyvu9MCGy+WgRdYFdXp96UcLZU=
github.com/soheilhy/cmux v0.1.4/go.mod h1:IM3LyeVVIOuxMH7sFAkER9+bJ4dT7Ms6E4xg4kGIyLM=
@ -1577,6 +1581,8 @@ github.com/zenazn/goji v1.0.1/go.mod h1:7S9M489iMyHBNxwZnk9/EHS098H4/F6TATF2mIxt
gitlab.com/nyarla/go-crypt v0.0.0-20160106005555-d9a5dc2b789b/go.mod h1:T3BPAOm2cqquPa0MKWeNkmOM5RQsRhkrwMWonFMN7fE=
go.etcd.io/bbolt v1.3.2/go.mod h1:IbVyRI1SCnLcuJnV2u8VeU0CEYM7e686BmAb1XKL+uU=
go.etcd.io/bbolt v1.3.3/go.mod h1:IbVyRI1SCnLcuJnV2u8VeU0CEYM7e686BmAb1XKL+uU=
go.etcd.io/bbolt v1.3.6 h1:/ecaJf0sk1l4l6V4awd65v2C3ILy7MSj+s/x1ADCIMU=
go.etcd.io/bbolt v1.3.6/go.mod h1:qXsaaIqmgQH0T+OPdb99Bf+PKfBBQVAdyD6TY9G8XM4=
go.etcd.io/etcd v0.0.0-20191023171146-3cf2f69b5738/go.mod h1:dnLIgRNXwCJa5e+c6mIZCrds/GIG4ncV9HhK5PX7jPg=
go.mongodb.org/mongo-driver v1.0.3/go.mod h1:u7ryQJ+DOzQmeO7zB6MHyr8jkEQvC8vH7qLUO4lqsUM=
go.mongodb.org/mongo-driver v1.1.0/go.mod h1:u7ryQJ+DOzQmeO7zB6MHyr8jkEQvC8vH7qLUO4lqsUM=
@ -1840,6 +1846,7 @@ golang.org/x/sys v0.0.0-20200625212154-ddb9806d33ae/go.mod h1:h1NjWce9XRLGQEsW7w
golang.org/x/sys v0.0.0-20200803210538-64077c9b5642/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200826173525-f9321e4c35a6/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200905004654-be1d3432aa8f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200923182605-d9f96fdee20d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201029080932-201ba4db2418/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=