sourcegraph/doc/dev/background-information/postgresql.md

# PostgreSQL

Sourcegraph stores most data in a
[PostgreSQL database](http://www.postgresql.org). Git repositories,
uploaded user content (e.g., image attachments in issues) are stored
on the filesystem.

## Version requirements

Our minimum supported version is `12` which means you must use that for development to ensure we don't use Postgres features introduced after that version.

For Ubuntu 18.04, you will need to add a repository source. Use the
[PostgreSQL.org official repo and instructions.](https://www.postgresql.org/download/linux/ubuntu/)

## Local setup

Sourcegraph assumes it has a dedicated PostgreSQL server, or at least that you
can make global configuration changes, such as changing the timezone. If you
need to use other settings for other databases, use a separate PostgreSQL
instance.

See the [Quickstart guide](../setup/quickstart.md).

## Migrations

Migrations get applied automatically at application startup—you
shouldn't need to run anything by hand. For full documentation see
[migrations/README.md](https://github.com/sourcegraph/sourcegraph/blob/main/migrations/README.md)

## Style guide

Here is the preferred style going forward. Existing tables may be inconsistent with this style.

### Avoiding nullable columns

Use a `NOT NULL` constraint whenever possible to enforce having a value on every column. `NULL` values can easily introduce errors when not handled correctly, and for many fields it makes sense to always have a value anyways.

For example, a `"revision" text` column can use `NULL` to represent no revision, or instead `""` (empty string). On the other hand, it makes sense to represent a `"deleted_at" timestamp` field as `NULL`, meaning "this row has not been deleted".

When NULL fields are necessary, remember to use [`Null*` types](https://golang.org/pkg/database/sql/#NullString) in Go when querying this data. Otherwise `row.Scan` will error after encountering a `NULL` value.

```Go
var s sql.NullString
// Column name can be NULL
err := db.QueryRow("SELECT name FROM foo WHERE id=?", id).Scan(&s)
...
if s.Valid {
   // use s.String
} else {
   // NULL value
}
```

### Recommended columns for all tables

- `id` auto increment primary key.
- `created_at` not null default `now()` set when a row is first inserted and never updated after that.
- `updated_at` not null default `now()` set when a row is first inserted and updated on every update.
- `deleted_at` set to a not null timestamp to indicate the row is deleted (called soft deleting). This is preferred over hard deleting data from our db (see discussion section below).
  - When querying the db, rows with a non-null `deleted_at` should be excluded.

The timestamps are useful for forensics if something goes wrong, they do not necessarily need to be used or exposed by our graphql APIs. There is no harm in exposing them though.

Example:

```sql
CREATE TABLE "widgets" (
	"id" bigserial NOT NULL PRIMARY KEY,
	"created_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
	"updated_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),
	"deleted_at" TIMESTAMP WITH TIME ZONE,
);
```

### Hard vs soft deletes

Definitions:

- A "hard" delete is when rows are deleted using `DELETE FROM table WHERE ...`
- A "soft" delete is when rows are deleted using `UPDATE table SET deleted_at = now() WHERE ...`

Hard deletes are hard to recover from if something goes wrong (application bug, bad migration, manual query, etc.). This usually involves restoring from a backup and it is hard to target only the data affected by the bad delete.

Soft deletes are easier to recover from once you determine what happened. You can simply find the affected rows and `UPDATE table SET deleted_at = null WHERE ...`.

### Dealing with unique constraints

Soft deleting data has implications for unique constraints.

Consider a hypothetical schema:

```sql
CREATE TABLE "orgs" (
	"id" serial NOT NULL PRIMARY KEY
);

CREATE TABLE "users" (
	"id" serial NOT NULL PRIMARY KEY
);

CREATE TABLE "users_orgs" (
	"id" serial NOT NULL PRIMARY KEY,
	"user_id" integer NOT NULL,
	"org_id" integer NOT NULL,

	CONSTRAINT user_orgs_references_orgs
	FOREIGN KEY (org_id)
	REFERENCES orgs (id) ON DELETE RESTRICT,

	CONSTRAINT users_references_users
	FOREIGN KEY (user_id)
	REFERENCES users (id) ON DELETE RESTRICT,

	UNIQUE (user_id, org_id)
);
```

#### Hard delete case

Removing a user from an org deletes the row from `user_orgs`.

Adding a user inserts a row to `user_orgs`. If the user is already a user of the org, the insert fails.

If we wanted to keep a record of membership, it would need to be in a separate audit log table.

#### Soft delete case

Removing a user from an org sets a non-null timestamp on the `deleted_at` column for the row.

Adding a user to an org sets `deleted_at = null` if there is already an existing record for that combination of `user_id` and `org_id`, else a new record is inserted.

Alternatively, we could remove the unique constraint on `user_id` and `org_id` and always insert in the add user case (after checking to see if the user is in the org). This would then function as an audit log table.

The decision here can be made on a table by table basis.

### Use foreign keys

If you have a column that references another column in the database, add a foreign key constraint.

There are reasons to not use foreign keys at scale, but we are not at scale and we can drop these in the future if they become a problem.

### Don't cascade deletes

Foreign key constraints should not cascade deletes for a few reasons:

1.  We don't want to accidentally delete a lot of data (either from our application, or from a manual query in prod).
2.  If we ever add new tables that depend on other tables via foreign key, it is not necessarily the case that cascading the delete is correct for the new table. Explicit application code is better here.
3.  If we ever get to the point of sharding the db, we will probably need to drop all foreign key constraints so it would be great if we did not make our code depend on cascading delete behavior.

Instead of cascading deletes, applications should explicitly delete the rows that would otherwise get deleted if cascading deletes were enabled.

### Table names

Tables are plural (e.g. repositories, users, comments, etc.).

Join tables should be named based on the two tables being joined (e.g. `foo_bar` joins `foo` and `bar`).

### Validation

To the extent that certain fields require validation (e.g. username) we should perform that validation in client AND EITHER the database when possible, OR the graphql api. This results in the best experience for the client, and protects us from corrupt data.

### Triggers

Because a trigger resides in the database and anyone who has the required privilege can use it, a trigger lets you write a set of SQL statements that multiple applications can use. It lets you avoid redundant code when multiple programs need to perform the same database operation.

Triggers are usually a good tool for:

- Computing derived column values automatically.
- Enforcing complex integrity constraints (e.g. when a faster CHECK constraint isn't powerful enough).
- Maintaining derived tables (e.g creating an audit trail of activity in the database).

Triggers are often not a good tool for:

- Implementing complex application logic.

Whatever you end up doing, ensure that the affected code paths are tested appropriately. For instance, if you created a trigger that populates a column, you should test that a record that is written can be read back with the affected column set to what you expect.

A test like this ought to exercise your data access module, regardless of the design pattern you use to implement it (`Store`, `Repo`, `DAO`, `DB`, etc). If you refactor your code to move some functionality from the trigger to the application code, or vice-versa, your tests should not need to change.

Here's an example of how you could [structure such tests](https://github.com/sourcegraph/sourcegraph/blob/da3743ece358fbe6709f07e95d5fd97bd554e047/cmd/repo-updater/repos/integration_test.go#L17).

If you're uncertain about using triggers as part of your work, do some research before committing to a solution and don't hesitate to discuss it with your peers.
Documentation reorg: Development docs (#14509) 2020-10-08 05:13:59 +00:00			`# PostgreSQL`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`Sourcegraph stores most data in a`
			`[PostgreSQL database](http://www.postgresql.org). Git repositories,`
			`uploaded user content (e.g., image attachments in issues) are stored`
			`on the filesystem.`

			`## Version requirements`

database: Require Postgres 12 (#19424) 2021-03-30 22:35:21 +00:00			Our minimum supported version is `12` which means you must use that for development to ensure we don't use Postgres features introduced after that version.
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`For Ubuntu 18.04, you will need to add a repository source. Use the`
			`[PostgreSQL.org official repo and instructions.](https://www.postgresql.org/download/linux/ubuntu/)`

Documentation reorg: Development docs (#14509) 2020-10-08 05:13:59 +00:00			`## Local setup`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`Sourcegraph assumes it has a dedicated PostgreSQL server, or at least that you`
			`can make global configuration changes, such as changing the timezone. If you`
			`need to use other settings for other databases, use a separate PostgreSQL`
			`instance.`

docs: rework setup instructions (#27959) Co-authored-by: Thorsten Ball <mrnugget@gmail.com> 2021-11-29 12:45:47 +00:00			`See the [Quickstart guide](../setup/quickstart.md).`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`## Migrations`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
Docs: Replace hyphens in text with em-dashes (#42367) 2023-01-31 12:18:49 +00:00			`Migrations get applied automatically at application startup—you`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00			`shouldn't need to run anything by hand. For full documentation see`
dev: support switching our default branch to main (#11453) This mostly updates our dev CI/helpers which hardcoded master to instead hardcode main. In the case of CI we treat both main and master as the default branch. 2020-08-05 07:44:51 +00:00			`[migrations/README.md](https://github.com/sourcegraph/sourcegraph/blob/main/migrations/README.md)`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`## Style guide`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`Here is the preferred style going forward. Existing tables may be inconsistent with this style.`

Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`### Avoiding nullable columns`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			Use a `NOT NULL` constraint whenever possible to enforce having a value on every column. `NULL` values can easily introduce errors when not handled correctly, and for many fields it makes sense to always have a value anyways.

			For example, a `"revision" text` column can use `NULL` to represent no revision, or instead `""` (empty string). On the other hand, it makes sense to represent a `"deleted_at" timestamp` field as `NULL`, meaning "this row has not been deleted".

			When NULL fields are necessary, remember to use [`Null*` types](https://golang.org/pkg/database/sql/#NullString) in Go when querying this data. Otherwise `row.Scan` will error after encountering a `NULL` value.

			```Go
			`var s sql.NullString`
			`// Column name can be NULL`
			`err := db.QueryRow("SELECT name FROM foo WHERE id=?", id).Scan(&s)`
			`...`
			`if s.Valid {`
			`// use s.String`
			`} else {`
			`// NULL value`
			`}`
			```

Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`### Recommended columns for all tables`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			- `id` auto increment primary key.
			- `created_at` not null default `now()` set when a row is first inserted and never updated after that.
			- `updated_at` not null default `now()` set when a row is first inserted and updated on every update.
			- `deleted_at` set to a not null timestamp to indicate the row is deleted (called soft deleting). This is preferred over hard deleting data from our db (see discussion section below).
			- When querying the db, rows with a non-null `deleted_at` should be excluded.

			`The timestamps are useful for forensics if something goes wrong, they do not necessarily need to be used or exposed by our graphql APIs. There is no harm in exposing them though.`

			`Example:`

			```sql
			`CREATE TABLE "widgets" (`
			`"id" bigserial NOT NULL PRIMARY KEY,`
			`"created_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),`
			`"updated_at" TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT now(),`
			`"deleted_at" TIMESTAMP WITH TIME ZONE,`
			`);`
			```

Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`### Hard vs soft deletes`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`Definitions:`

			- A "hard" delete is when rows are deleted using `DELETE FROM table WHERE ...`
			- A "soft" delete is when rows are deleted using `UPDATE table SET deleted_at = now() WHERE ...`

			`Hard deletes are hard to recover from if something goes wrong (application bug, bad migration, manual query, etc.). This usually involves restoring from a backup and it is hard to target only the data affected by the bad delete.`

			Soft deletes are easier to recover from once you determine what happened. You can simply find the affected rows and `UPDATE table SET deleted_at = null WHERE ...`.

			`### Dealing with unique constraints`

			`Soft deleting data has implications for unique constraints.`

			`Consider a hypothetical schema:`

			```sql
			`CREATE TABLE "orgs" (`
			`"id" serial NOT NULL PRIMARY KEY`
			`);`

			`CREATE TABLE "users" (`
			`"id" serial NOT NULL PRIMARY KEY`
			`);`

			`CREATE TABLE "users_orgs" (`
			`"id" serial NOT NULL PRIMARY KEY,`
			`"user_id" integer NOT NULL,`
			`"org_id" integer NOT NULL,`

			`CONSTRAINT user_orgs_references_orgs`
			`FOREIGN KEY (org_id)`
			`REFERENCES orgs (id) ON DELETE RESTRICT,`

			`CONSTRAINT users_references_users`
			`FOREIGN KEY (user_id)`
			`REFERENCES users (id) ON DELETE RESTRICT,`

			`UNIQUE (user_id, org_id)`
			`);`
			```

			`#### Hard delete case`

			Removing a user from an org deletes the row from `user_orgs`.

			Adding a user inserts a row to `user_orgs`. If the user is already a user of the org, the insert fails.

			`If we wanted to keep a record of membership, it would need to be in a separate audit log table.`

			`#### Soft delete case`

			Removing a user from an org sets a non-null timestamp on the `deleted_at` column for the row.

			Adding a user to an org sets `deleted_at = null` if there is already an existing record for that combination of `user_id` and `org_id`, else a new record is inserted.

			Alternatively, we could remove the unique constraint on `user_id` and `org_id` and always insert in the add user case (after checking to see if the user is in the org). This would then function as an audit log table.

			`The decision here can be made on a table by table basis.`

Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`### Use foreign keys`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`If you have a column that references another column in the database, add a foreign key constraint.`

			`There are reasons to not use foreign keys at scale, but we are not at scale and we can drop these in the future if they become a problem.`

			`### Don't cascade deletes`

			`Foreign key constraints should not cascade deletes for a few reasons:`

			`1. We don't want to accidentally delete a lot of data (either from our application, or from a manual query in prod).`
			`2. If we ever add new tables that depend on other tables via foreign key, it is not necessarily the case that cascading the delete is correct for the new table. Explicit application code is better here.`
			`3. If we ever get to the point of sharding the db, we will probably need to drop all foreign key constraints so it would be great if we did not make our code depend on cascading delete behavior.`

			`Instead of cascading deletes, applications should explicitly delete the rows that would otherwise get deleted if cascading deletes were enabled.`

Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`### Table names`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`Tables are plural (e.g. repositories, users, comments, etc.).`

			Join tables should be named based on the two tables being joined (e.g. `foo_bar` joins `foo` and `bar`).

Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`### Validation`
Publish Sourcegraph as open source 🚀 Internal revision: sourcegraph/sourcegraph-intermediate-2018@d0cebbadc708dc53ddddee6753f494214a589b26 Executor: @slimsag 2018-10-01 06:08:12 +00:00
			`To the extent that certain fields require validation (e.g. username) we should perform that validation in client AND EITHER the database when possible, OR the graphql api. This results in the best experience for the client, and protects us from corrupt data.`

Fixed doc page hierarchy, removing inline h1 headings (#8170) 2020-02-03 01:39:27 +00:00			`### Triggers`
Update Triggers section in Postgres guidelines (#2978) * Update Triggers section in Postgres guidelines * docs: Elaborate on Postgres trigger testing 2019-03-26 16:03:40 +00:00
			`Because a trigger resides in the database and anyone who has the required privilege can use it, a trigger lets you write a set of SQL statements that multiple applications can use. It lets you avoid redundant code when multiple programs need to perform the same database operation.`

			`Triggers are usually a good tool for:`

			`- Computing derived column values automatically.`
			`- Enforcing complex integrity constraints (e.g. when a faster CHECK constraint isn't powerful enough).`
			`- Maintaining derived tables (e.g creating an audit trail of activity in the database).`

			`Triggers are often not a good tool for:`

			`- Implementing complex application logic.`

			`Whatever you end up doing, ensure that the affected code paths are tested appropriately. For instance, if you created a trigger that populates a column, you should test that a record that is written can be read back with the affected column set to what you expect.`

			A test like this ought to exercise your data access module, regardless of the design pattern you use to implement it (`Store`, `Repo`, `DAO`, `DB`, etc). If you refactor your code to move some functionality from the trigger to the application code, or vice-versa, your tests should not need to change.

			`Here's an example of how you could [structure such tests](https://github.com/sourcegraph/sourcegraph/blob/da3743ece358fbe6709f07e95d5fd97bd554e047/cmd/repo-updater/repos/integration_test.go#L17).`

			`If you're uncertain about using triggers as part of your work, do some research before committing to a solution and don't hesitate to discuss it with your peers.`