diff --git a/docs/about.md b/docs/about.md index d258c34..3ed651f 100644 --- a/docs/about.md +++ b/docs/about.md @@ -16,4 +16,4 @@ PgDog source code, so the whole community can benefit from your knowledge and ex ## Project name -This project is dedicated to the bestest dog in the world who's been patiently sitting at my feet the entire time PgDog has been developed. +This project is dedicated to the best dog in the world who's been patiently sitting at my feet the entire time PgDog has been developed. diff --git a/docs/architecture/benchmarks.md b/docs/architecture/benchmarks.md index fbad5ee..302ba1e 100644 --- a/docs/architecture/benchmarks.md +++ b/docs/architecture/benchmarks.md @@ -1,6 +1,6 @@ # Benchmarks -PgDog does its best to minimize its impact on database performance. Great care is taken to make sure as few operations are possible are performed +PgDog does its best to minimize its impact on database performance. Great care is taken to make sure as few operations as possible are performed when passing data between clients and servers. All benchmarks listed below were done on my local system (AMD Ryzen 7 5800X). Real world performance is impacted by factors like network speed, query complexity and especially by hardware used for running PgDog and PostgreSQL servers. diff --git a/docs/features/index.md b/docs/features/index.md index c8b72ed..9a7c052 100644 --- a/docs/features/index.md +++ b/docs/features/index.md @@ -3,27 +3,31 @@ PgDog contains multiple foundational and unique features which make it a great choice for modern PostgreSQL deployments. -Most features are configurable and can be toggled and tuned. Experimental features are marked -as such, and users are advised to test them before deploying to production. Most foundational features like -load balancing, healthchecks, and query routing have been battle-tested and work well in production. +Most features are configurable and can be toggled on/off and tuned to fit your environment. Experimental features are marked +as such, and users are advised to test them before deploying to production. + +Foundational features like load balancing, health checks, and query routing have been battle-tested and work well in production. ## Summary -Short summary of currently implemented features. +Short summary of currently implemented features below. | Feature | Description | |---------|-------------| -| [Load balancer](load-balancer/index.md) | Distribute `SELECT` queries evenly between replicas. Separate reads from writes with a single endpoint. | -| [Health checks](load-balancer/healthchecks.md) | Check databases are up and running, and can serve queries. | -| [Transaction mode](transaction-mode.md) | Share PostgreSQL connections between thousands of clients, a necessary feature for production deployments. | -| [Hot reload](../configuration/index.md) | Update configuration at runtime without restarting PgDog. | -| [Sharding](sharding/index.md) | Automatic query routing and logical replication between data nodes to scale PostgreSQL horizontally. | -| [Prepared statements](prepared-statements.md) | Support for Postgres named prepared statements. | -| [Plugins](plugins/index.md) | Pluggable libraries to add functionality to PgDog at runtime. | -| [Authentication](authentication.md) | Support for various PostgreSQL authentication mechanisms, like SCRAM. | +| [Load balancer](load-balancer/index.md) | Distribute `SELECT` queries evenly between replicas. Separate reads from writes, allowing applications to connect to a single endpoint. | +| [Health checks](load-balancer/healthchecks.md) | Check databases are up and running. Broken databases are blocked from serving queries. | +| [Transaction mode](transaction-mode.md) | Multiplex PostgreSQL server connections between thousands of clients. | +| [Hot reload](../configuration/index.md) | Update configuration at runtime without restarting the proxy. | +| [Sharding](sharding/index.md) | Automatic query routing and data migration between nodes to scale PostgreSQL horizontally. Schema management, distributed transactions. | +| [Prepared statements](prepared-statements.md) | Support for Postgres named prepared statements in transaction mode. | +| [Plugins](plugins/index.md) | Pluggable libraries to add functionality to PgDog at runtime, without recompiling code. | +| [Authentication](authentication.md) | Support for various PostgreSQL user authentication mechanisms, like SCRAM. | | [Session mode](session-mode.md) | Compatibility mode with direct PostgreSQL connections. | -| [Metrics](metrics.md) | Real time reporting, including Prometheus/OpenMetrics and admin database. | +| [Metrics](metrics.md) | Real time reporting, including Prometheus/OpenMetrics and an admin database. | +| [Mirroring](mirroring.md) | Copy queries from one database to another in the background. | +| [Pub/Sub](pub_sub.md) | Support for `LISTEN`/`NOTIFY` in transaction mode. | +| [Encryption](tls.md) | TLS encryption for client and server connections. | -## OS support +### OS support PgDog doesn't use any OS-specific features and should run on all systems supported by the Rust compiler, e.g. Linux (x86 and ARM64), Mac OS, and Windows. diff --git a/docs/features/load-balancer/healthchecks.md b/docs/features/load-balancer/healthchecks.md index b0e1f59..d1bd308 100644 --- a/docs/features/load-balancer/healthchecks.md +++ b/docs/features/load-balancer/healthchecks.md @@ -1,58 +1,74 @@ # Health checks -Databases proxied by PgDog are regularly checked with health checks. A health check is a simple query, e.g., -`SELECT 1`, that ensures the database is reachable and able to process queries. - -## How it works - -If a database fails a health check, it's placed in a list of banned hosts. Banned databases are removed -from the load balancer and will not serve queries. This allows PgDog to reduce errors clients see -when a database fails, for example due to hardware issues. +Databases proxied by PgDog's load balancer are regularly checked with health checks. A health check is a simple query, e.g., +`SELECT 1`, that ensures the database is reachable and able to handle requests. If a replica database fails a health check, +it's removed from the load balancer and prevented from serving additional queries for a configurable period of time.
- Healtchecks + Healthchecks
-### Checking connections +### Primary checks + +While all databases receive health checks, only replicas can be removed from the load balancer. If the primary fails a health check, it will continue to serve writes. This is because the cluster doesn't have an alternative place to route these requests and attempting the primary again has a higher chance of success than blocking queries outright. + +### Individual connections -In addition to checking databases, PgDog ensures that every connection in the pool is healthy on a regular basis. Before giving a connection to a client, PgDog will occasionally send the same simple query to the server, and if the query fails, ban the entire database from serving any more queries. +In addition to checking entire databases, the load balancer checks that every connection in the pool is healthy on a regular basis. Before giving a connection to a client, it will, from time to time, send a short query to the server, and if it fails, ban the entire database from serving any more requests. -To reduce the overhead of health checks, connection-specific checks are done infrequently, configurable via the `healtcheck_interval` setting: +To reduce the overhead of health checks, these connection-specific checks are done infrequently. This is configurable via the `healthcheck_interval` setting: ```toml [general] healthcheck_interval = 30_000 # Run a health check every 30 seconds ``` -Health checks are **enabled** by default. The default setting value is `30_000` (30 seconds). +The default value for this setting is `30_000` (30 seconds). + +### Configuring health checks + +Health checks are **enabled** by default. + +If you want, you can effectively disable them by setting both `healthcheck_interval` and `idle_healthcheck_interval` settings to a very high value, for example: + +```toml +[general] +healthcheck_interval = 31557600000 # 1 year +idle_healthcheck_interval = 31557600000 +``` + +### Database bans -### Triggering bans +A single health check failure will prevent the entire database from serving traffic. In our documentation (and the code), we refer to this as a "ban". -A single health check failure will prevent the entire database from serving traffic. This may seem aggressive at first, but it reduces the error rate dramatically in heavily used production deployments. PostgreSQL is very reliable, so even a single query failure may indicate an issue with hardware or network connectivity. +This may seem aggressive at first, but it reduces the error rate dramatically in heavily used production deployments. PostgreSQL is very reliable, so even a single failure often indicates an issue with the hardware or network connectivity. #### Failsafe -To avoid health checks taking a database cluster offline, the load balancer has a built-in safety mechanism. If all replicas fail health checks, bans from all databases are removed and all databases are allowed to serve traffic again. This ensures that intermittent network failures don't impact database operations. Once the bans are removed, load balancing returns to its normal state. +To avoid health checks taking the whole database cluster offline, the load balancer has a built-in safety mechanism. If all replicas fail a health check, the bans from all databases in the cluster are removed. + +This makes sure that intermittent network failures don't impact database operations. Once the bans are removed, load balancing returns to its normal state. #### Ban expiration -Database bans have an expiration. Once the ban expires, the replica is unbanned and allowed to serve traffic again. This is done to maintain a healthy level of traffic across all databases and to allow for intermittent -issues, like network connectivity, to resolve themselves without manual intervention. +Database bans eventually expire and are removed automatically. Once this happens, the banned databases are allowed to serve traffic again. This is done to maintain a healthy level of traffic across all databases and to allow for intermittent issues, like network connectivity, to resolve themselves without manual intervention. -This behavior is controlled with the `ban_timeout` setting: +This behavior can be controlled with the `ban_timeout` setting: ```toml [general] ban_timeout = 300_000 # Expire bans automatically after 5 minutes ``` -The default value is `300_000` (5 minutes). +The default value for this setting is `300_000` (5 minutes). + +#### Health check timeout -### Health check timeout +By default, the load balancer gives the database **5 seconds** to answer a health check. If it doesn't receive a reply within that time frame, the database will be banned and removed from the load balancer. -By default, PgDog gives the database **5 seconds** to answer a health check. This is configurable with `healthcheck_timeout`. If PgDog doesn't receive a reply within that time frame, the database will be banned from serving traffic. + This is configurable with `healthcheck_timeout` setting: ```toml [global] diff --git a/docs/features/load-balancer/index.md b/docs/features/load-balancer/index.md index 4505444..c5aaf59 100644 --- a/docs/features/load-balancer/index.md +++ b/docs/features/load-balancer/index.md @@ -1,43 +1,42 @@ --- next_steps: - - ["Health checks", "/features/healthchecks/", "Learn how PgDog ensures only healthy databases serve read queries."] + - ["Health checks", "/features/healthchecks/", "Learn how PgDog ensures only healthy databases are allowed to serve read queries."] --- # Load balancer overview PgDog operates at the application layer (OSI Level 7) and is capable of load balancing queries across -multiple PostgreSQL replicas. +multiple PostgreSQL replicas. This allows applications to connect to a single endpoint and spread traffic evenly between multiple databases. ## How it works -When a query is sent to PgDog, it inspects it using a SQL parser. If the query is a read and PgDog configuration contains multiple databases, it will send that query to one of the replicas, spreading the query load evenly between all instances in the cluster. +When a query is sent to PgDog, it inspects it using a SQL parser. If the query is a read and the configuration contains multiple databases, it will send that query to one of the replicas. This spreads the query load evenly between all database instances in the cluster. -If the configuration contains the primary as well, PgDog will separate writes from reads and send writes to the primary, without requiring any application changes. +If the config contains a primary, PgDog will split write queries from read queries and send writes to the primary, without requiring any application changes.
Load balancer
-### Strategies +### Algorithms -The PgDog load balancer is configurable and can route queries -using one of several strategies: +The load balancer is configurable and can route queries using one of the following strategies: * Random (default) * Least active connections * Round robin -Choosing a strategy depends on your query workload and the size of replica databases. Each strategy has its pros and cons. If you're not sure, using the **random** strategy is usually good enough +Choosing the right strategy depends on your query workload and the size of replica databases. Each strategy has its pros and cons. If you're not sure, using the **random** strategy is usually good enough for most deployments. #### Random -Queries are sent to a database based using a random number generator modulus the number of replicas in the pool. +Queries are routed to a database based on a random number generator modulus the number of replicas in the pool. This strategy is the simplest to understand and often effective at splitting traffic evenly across the cluster. It's unbiased -and assumes nothing about available resources or query performance. +and assumes nothing about available resources or individual query performance. -This strategy is used by **default**. +This algorithm is used by **default**. ##### Configuration @@ -48,11 +47,10 @@ load_balancer_strategy = "random" #### Least active connections -PgDog keeps track of how many active connections each database has and can route queries to databases -which are least busy executing requests. This allows to "bin pack" the cluster based on how seemingly active -(or inactive) the databases are. +PgDog keeps track of how many connections are active in each database and can route queries to databases +which are less busy. This allows to "bin pack" the cluster with workload. -This strategy is useful when all databases have identical resources and all queries have roughly the same +This algorithm is useful when all databases have identical resources and all queries have roughly the same cost and runtime. ##### Configuration @@ -64,11 +62,11 @@ load_balancer_strategy = "least_active_connections" #### Round robin -This strategy is often used in HTTP load balancers, like nginx, to route requests to hosts in the -same order they appear in the configuration. Each database receives exactly one query before the next +This strategy is often used in HTTP load balancers (e.g., like nginx) to route requests to hosts in the +same order as they appear in the configuration. Each database receives exactly one query before the next one is used. -This strategy makes the same assumptions as [least active connections](#least-active-connections), except it makes no attempt to bin pack the cluster with workload, and distributes queries evenly. +This algorithm makes the same assumptions as [least active connections](#least-active-connections), except it makes no attempt to bin pack the cluster and distributes queries evenly. ##### Configuration @@ -79,15 +77,17 @@ load_balancer_strategy = "round_robin" ## Reads and writes -The load balancer can split reads (`SELECT` queries) from write queries. If PgDog detects that a query is _not_ a `SELECT`, like an `INSERT` or and `UPDATE`, that query will be sent to primary. This allows a PgDog deployment to proxy an entire PostgreSQL cluster without creating separate read and write endpoints. +The load balancer can split reads (`SELECT` queries) from write queries. If it detects that a query is _not_ a `SELECT`, like an `INSERT` or an `UPDATE`, that query will be sent to primary. This allows a deployment to proxy an entire PostgreSQL cluster without creating separate read and write endpoints. -This strategy is effective most of the time, and PgDog also handles several edge cases. +This strategy is effective most of the time and PgDog also handles several edge cases. -#### Select for update +### `SELECT FOR UPDATE` -The most common one is `SELECT [...] FOR UDPATE` which locks rows for exclusive access. Much like the name suggests, the most common use case for this is to update the row, which is a write operation. PgDog will detect this and send the query to the primary instead. +The most common edge case is `SELECT FOR UPDATE` which locks rows for exclusive access. Much like the name suggests, it's often used to update the selected rows, which is a write operation. -#### CTEs that write +The load balancer detects this and will send the query to a primary instead of a replica. + +### CTEs Some `SELECT` queries can trigger a write to the database from a CTE, for example: @@ -98,11 +98,11 @@ WITH t AS ( SELECT * FROM users INNER JOIN t ON t.id = users.id ``` -PgDog will check all CTEs and if any of them contain queries that could write, it will send the entire query to the primary. +The load balancer will check all CTEs and, if any of them contain queries that could write, it will route the entire query to a primary. ### Transactions -All explicit transactions are routed to the primary. An explicit transaction is started by using the `BEGIN` statement, e.g.: +All multi-statement transactions are routed to the primary. They are started by using the `BEGIN` command, e.g.: ```postgresql BEGIN; @@ -110,16 +110,21 @@ INSERT INTO users (email, created_at) VALUES ($1, NOW()) RETURNING *; COMMIT; ``` -While more often used to atomically perform writes to multiple tables, transactions can also manually route read queries to the primary as to avoid having to handle replication lag for time-sensitive queries. +While often used to atomically perform multiple changes, transactions can also be used explicitly route read queries to a primary as to avoid having to handle replication lag. + +This is useful for time-sensitive workloads, like background jobs that have been triggered by a database change which hasn't propagated to all the replicas yet. !!! note - This is common with background jobs that get triggered after a row has been inserted by an HTTP controller. - The job queue is often configured to read data from a replica, which is a few milliseconds behind the primary and, unless specifically handled, could run into "record not found" errors. + This behavior often manifests with "record not found"-style errors, e.g.: + + ``` + ActiveRecord::RecordNotFound (Couldn't find User with 'id'=9999): + ``` ## Configuration -The load balancer is enabled automatically when cluster contains more than +The load balancer is **enabled** automatically when a database cluster contains more than one database, for example: ```toml @@ -134,9 +139,9 @@ role = "replica" host = "10.0.0.2" ``` -### Reads on the primary +### Allowing reads on the primary -By default, the primary is used for serving reads and writes. If you want to isolate your workloads and have your replicas serve all read queries, you can configure it like so: +By default, the primary is used for serving both reads and writes. If you want to isolate these workloads and have your replicas serve all read queries instead, you can configure it, like so: ```toml [general] diff --git a/docs/features/multi-tenancy.md b/docs/features/multi-tenancy.md index 653519e..4b73c2f 100644 --- a/docs/features/multi-tenancy.md +++ b/docs/features/multi-tenancy.md @@ -7,7 +7,7 @@ PgDog is a natural fit for multitenant databases. It allows to separate data usi There are two ways to enforce multitenancy rules with PgDog: 1. Physical multitenancy -2. Virtual multinancy +2. Virtual multitenancy **Physical** multitenancy separates data into multiple Postgres databases. In that scenario, it becomes very difficult for data from one tenant to make its way to another, providing a good layer of security and workload isolation between your customers. @@ -101,7 +101,7 @@ Much like Postgres partitions, the start of the range is included in the range w regularly with its status. The documentation is written in a way as to reflect the desired state of this feature, not how it currently works. -Virtual multinancy is a great option if your customers are small and can share the same compute. To make this work you have several options: +Virtual multitenancy is a great option if your customers are small and can share the same compute. To make this work you have several options: 1. Place each of your tenants data into their own Postgres schema 2. Add a column in every table identifying your tenants and make sure your app includes it in every query diff --git a/docs/features/sharding/copy.md b/docs/features/sharding/copy.md index 2424200..c5de6c3 100644 --- a/docs/features/sharding/copy.md +++ b/docs/features/sharding/copy.md @@ -1,6 +1,6 @@ # COPY -`COPY` is a special PostgreSQL command that ingests a file directly into a database table. This allows to ingest data faster than by using individual `INSERT` queries. +`COPY` is a special PostgreSQL command that ingests a file directly into a database table. This allows ingesting data faster than by using individual `INSERT` queries. PgDog supports parsing this command, sharding the file automatically, and splitting the data between shards, invisible to the application.
@@ -18,7 +18,7 @@ PgDog supports data sent via `COPY` formatted using any one of the 3 possible fo ### Expected syntax -`COPY` commands sent through PgDog should specify table columns explicitly. This allows itg to parse the data stream correctly, knowing which column is the sharding key. +`COPY` commands sent through PgDog should specify table columns explicitly. This allows it to parse the data stream correctly, knowing which column is the sharding key. Take the following example: diff --git a/docs/features/sharding/cross-shard.md b/docs/features/sharding/cross-shard.md index 3ed199f..f258a9f 100644 --- a/docs/features/sharding/cross-shard.md +++ b/docs/features/sharding/cross-shard.md @@ -43,7 +43,7 @@ to values in `DataRow`[^1] messages based on their position in the `RowDescripti For example, if the query specifies `ORDER BY id ASC, email DESC`, both `id` and `email` columns will be present in the `RowDescription` message along with their data types and locations in `DataRow`[^1] messages. -The rows are received asynchronously as the query is executing on the shards. Once the messages are buffered, PgDog will sort them using the extracted column values and return the sorted result to the client. +The rows are received asynchronously as the query is executing on the shards. Once the messages are buffered, PgDog will sort them using the extracted column values and return the sorted result to the client. #### Example @@ -117,7 +117,7 @@ GROUP BY email; ## Supported data types -Processing results in PgDog requires it to parse Postgres data types from the wire protocol. Postgres supports many data types and PgDog, currently, can only handle a some of them. Clients can request results to be encoded in either `text` or `binary` encoding and supporting both requires special handling as well. +Processing results in PgDog requires it to parse Postgres data types from the wire protocol. Postgres supports many data types and PgDog, currently, can only handle some of them. Clients can request results to be encoded in either `text` or `binary` encoding and supporting both requires special handling as well. | Data type | Sorting | Aggregation | Text format | Binary format | |-|-|-|-|-| diff --git a/docs/features/sharding/index.md b/docs/features/sharding/index.md index 6d03e51..7e019fe 100644 --- a/docs/features/sharding/index.md +++ b/docs/features/sharding/index.md @@ -22,7 +22,7 @@ When sharding hints are not present in a query, either accidentally or on purpos ### Manual routing -If direct-to-shard queries are desired but the query doesn't have enough information to extract this information automatically, clients can specify which to which shard PgDog should route the query. +If direct-to-shard queries are desired but the query doesn't have enough information to extract this information automatically, clients can specify to which shard PgDog should route the query. [**→ Manual routing**](manual-routing.md) diff --git a/docs/features/sharding/internals/logical-replication/index.md b/docs/features/sharding/internals/logical-replication/index.md index 43ef679..2f0ace5 100644 --- a/docs/features/sharding/internals/logical-replication/index.md +++ b/docs/features/sharding/internals/logical-replication/index.md @@ -1,6 +1,6 @@ # Logical replication overview -One of PgDog's most interested features is its ability to interpret the logical replication protocol used by Postgres to synchronize replicas. This allows PgDog to reroute data depending on which shard it should go to in a sharded cluster. Since logical replication is streaming data in real time, PgDog can move data between shards invisibly to the client and without database downtime. +One of PgDog's most interesting features is its ability to interpret the logical replication protocol used by Postgres to synchronize replicas. This allows PgDog to reroute data depending on which shard it should go to in a sharded cluster. Since logical replication is streaming data in real time, PgDog can move data between shards invisibly to the client and without database downtime. ## Logical replication internals diff --git a/docs/features/sharding/migrations.md b/docs/features/sharding/migrations.md index 848222d..41f3ce3 100644 --- a/docs/features/sharding/migrations.md +++ b/docs/features/sharding/migrations.md @@ -3,7 +3,7 @@ PgDog expects that all shards have, roughly, the same tables. A notable exception to this rule is partitioned tables, which can have different data tables on different shards. The parent tables should be present on all shards, however. -If a shard has different tables than another, [automatic](query-routing.md) query routing may no work as expected. +If a shard has different tables than another, [automatic](query-routing.md) query routing may not work as expected. ## How it works diff --git a/docs/features/sharding/primary-keys.md b/docs/features/sharding/primary-keys.md index dd60d01..6ffb2fe 100644 --- a/docs/features/sharding/primary-keys.md +++ b/docs/features/sharding/primary-keys.md @@ -5,7 +5,7 @@ If you're coming from unsharded Postgres, you're probably used to doing somethin ```postgresql CREATE TABLE users ( id BIGSERIAL PRIMARY KEY, - email VARCHAR, + email VARCHAR ); ``` diff --git a/docs/features/sharding/resharding/hash.md b/docs/features/sharding/resharding/hash.md index 4944c18..575a244 100644 --- a/docs/features/sharding/resharding/hash.md +++ b/docs/features/sharding/resharding/hash.md @@ -17,7 +17,7 @@ You can visualize this phenomenon with a bit of Python: [1, 2, 0, 1, 2] ``` -Since most rows will have to moved, resharding a cluster in-place would put a lot of load on an already overextended system. +Since most rows will have to be moved, resharding a cluster in-place would put a lot of load on an already overextended system. PgDog's strategy for resharding is to **move data** from an existing cluster to a completely new one, while rehashing the rows in-flight. This process is parallelizable and fast, and since most of the work is done by the new cluster, production databases are not affected. @@ -43,7 +43,7 @@ pgdog data-sync --help | `--from-database` | Name of the **source** database cluster. | `prod` | | `--from-user` | Name of the user configured in `users.toml` for the **source** database cluster. | `postgres` | | `--to-database` | Name of the **destination** database cluster. | `prod-sharded` | -| `--to-user` | Name of the user configured in `users.tom` for the **destination** database cluster. | `postgres` | +| `--to-user` | Name of the user configured in `users.toml` for the **destination** database cluster. | `postgres` | | `--publication` | Name of the Postgres [publication](https://www.postgresql.org/docs/current/sql-createpublication.html) for tables to be copied and sharded. It should exist on the **source** database. | `all_tables` | All databases and users must be configured in `pgdog.toml` and `users.toml`. diff --git a/docs/features/sharding/sharding-functions.md b/docs/features/sharding/sharding-functions.md index 0220279..ada20ac 100644 --- a/docs/features/sharding/sharding-functions.md +++ b/docs/features/sharding/sharding-functions.md @@ -32,7 +32,7 @@ column = "user_id" data_type = "bigint" ``` -All queries referencing the `user_id` column will be automatically sent the matching shard(s) and data in those tables will be split between all shards evenly. +All queries referencing the `user_id` column will be automatically sent to the matching shard(s) and data in those tables will be split between all shards evenly. diff --git a/docs/features/transaction-mode.md b/docs/features/transaction-mode.md index 0126e0b..18cd8d9 100644 --- a/docs/features/transaction-mode.md +++ b/docs/features/transaction-mode.md @@ -1,30 +1,35 @@ # Transaction mode -Transaction mode allows PgDog to share just a few PostgreSQL server connections with thousands of clients. This is required for at-scale production deployments where the number of clients is much higher than the number of available Postgres connections. +Transaction mode allows PgDog to share just a few of PostgreSQL server connections with thousands of clients. This is required for at-scale production deployments where the number of clients is much higher than the number of available connections to the database. ## How it works -All queries served by PostgreSQL run inside transactions. Transactions can be started manually by executing a `BEGIN` statement. If a transaction is not started manually, each query sent to PostgreSQL is executed inside its own, automatic, transaction. +All queries served by PostgreSQL run inside transactions. Transactions can be started manually by executing a `BEGIN` command, or automatically by running individual statements. -PgDog takes advantage of this behavior and can separate client transactions inside client connections and send them, individually, to the first available PostgreSQL server in the connection pool. +PgDog takes advantage of this behavior and can split up transactions inside client connections and send them, individually and in order, to the first available PostgreSQL server in the connection pool.
Load balancer
-In practice, this allows thousands of client connections to use just one PostgreSQL server connection to execute queries. Most connection pools will have multiple server connections, so hundreds of thousands of clients can connect to PgDog and execute queries over just a handful of PostgreSQL server connections. +In practice, this allows thousands of client connections to re-use just one PostgreSQL server connection. Most pools will have several server connections, so 100,000s of clients can use the pooler to execute queries without exceeding the database connection limit. +## Enabling transaction mode -### Enabling transaction mode +Transaction mode is **enabled** by default. This is controllable via configuration, at the global, user and database levels: -Transaction mode is **enabled** by default. This is controllable via configuration, at the global -and user level: - -=== "pgdog.toml" +=== "pgdog.toml (global)" ```toml [general] pooler_mode = "transaction" ``` +=== "pgdog.toml (database)" + ```toml + [[databases]] + name = "prod" + host = "127.0.0.1" + pooler_mode = "transaction" + ``` === "users.toml" ```toml [[users]] @@ -33,13 +38,15 @@ and user level: pooler_mode = "transaction" ``` -### Session-level state +## Session-level state -Clients can set session-level variables, e.g., by passing them in connection parameters or using the `SET` command. This works fine when connecting to Postgres directly, but PgDog shares server -connections between multiple clients. To avoid session-level state leaking between clients, PgDog tracks connection parameters for each client and updates connection settings before -giving a connection to a client. +Clients can set session-level variables, e.g., by passing them in connection parameters or using the `SET` command. This works fine when connecting to Postgres directly, but transaction poolers share server connections between multiple clients. -#### Specifying connection parameters +To avoid session-level state leaking between clients, PgDog tracks connection parameters for each client and updates connection settings before giving a connection to each client. + +This is performed efficiently, and server parameters are updated only if they differ from the ones set on the client. + +### Specifying connection parameters Most Postgres connection drivers support passing parameters in the connection URL. Using the special `options` setting, each parameter is set using the `-c` flag, for example: @@ -48,20 +55,19 @@ postgres://user@host:6432/db?options=-c%20statement_timeout%3D3s ``` This sets the `statement_timeout` setting to `3s` (3 seconds). Each time this client -executes a transaction, PgDog will check the value for `statement_timeout` on the server connection, -and if it differs, issue a command to Postgres to update it, e.g.: +executes a transaction, the pooler will check the value for `statement_timeout` on the server connection, +and if it differs, issue a command to Postgres to update it: ```postgresql SET statement_timeout TO '3s'; ``` - -#### Tracking `SET` commands +### Tracking `SET` commands If the client manually changes server settings, i.e., by issuing `SET` commands, the server will send the updated setting -in a `ParameterStatus` message. PgDog will see this message and update client connection parameters accordingly, as to avoid +in a `ParameterStatus` message. The pooler will see this message and update client connection parameters accordingly, so as to avoid issuing unnecessary `SET` statements on subsequent transactions. -#### Impact on latency +### Impact on latency -PgDog keeps a real-time mapping for servers and their parameters, so checking the current value for any parameter doesn't require PgDog to talk to the database. Additionally, it's typically expected that applications have similar connection parameters, so PgDog won't have to synchronize parameters frequently. +PgDog keeps a real-time mapping of servers and their parameters, so checking the current value for any parameter doesn't require the pooler to talk to the database. Additionally, it's typically expected that applications have similar connection parameters, so the pooler won't have to synchronize parameters frequently. diff --git a/docs/images/healtchecks.png b/docs/images/healthchecks.png similarity index 100% rename from docs/images/healtchecks.png rename to docs/images/healthchecks.png diff --git a/docs/index.md b/docs/index.md index d452328..1948613 100644 --- a/docs/index.md +++ b/docs/index.md @@ -32,7 +32,7 @@ This documentation provides a detailed overview of all PgDog features, along wit

Administration

-

Learn how to operate PgDog in production, like fetching real time statistics from the admin database or updating configuration.

+

Learn how to operate PgDog in production, like fetching real-time statistics from the admin database or updating configuration.

Installation

diff --git a/docs/installation.md b/docs/installation.md index 56940d5..803192b 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -3,17 +3,16 @@ ## Kubernetes -PgDog comes with its own [Helm chart](https://github.com/pgdogdev/helm). You can install it from git: +PgDog comes with its own [Helm chart](https://github.com/pgdogdev/helm). You can install it directly from our chart repository: ```bash -git clone https://github.com/pgdogdev/helm pgdog-helm && \ -cd pgdog-helm && -helm install -f values.yaml pgdog ./ +helm repo add pgdogdev https://helm.pgdog.dev +helm install pgdog pgdogdev/pgdog ``` ## Docker -Docker images are built automatically for each commit in GitHub. You can fetch them directly from the [repository](https://github.com/pgdogdev/pgdog/pkgs/container/pgdog): +Docker images are built automatically for each commit to the `main` branch in [GitHub](https://github.com/pgdogdev/pgdog/pkgs/container/pgdog): ```bash docker run ghcr.io/pgdogdev/pgdog:main @@ -21,56 +20,55 @@ docker run ghcr.io/pgdogdev/pgdog:main ## From source -PgDog is easily compiled from source. For production deployments, a `Dockerfile` is provided in the [repository](https://github.com/pgdogdev/pgdog/tree/main/Dockerfile). If you prefer to deploy on bare metal or you're looking to run PgDog locally, you'll need to install a few dependencies. +PgDog can be easily compiled from source. For production deployments, a `Dockerfile` is provided in [GitHub](https://github.com/pgdogdev/pgdog/tree/main/Dockerfile). If you prefer to deploy on bare metal or you're looking to run PgDog locally, you'll need to install a few dependencies. ### Dependencies Parts of PgDog depend on C/C++ libraries, which are compiled from source. Make sure to have a working version of a C/C++ compiler installed. -#### Mac OS +=== "Mac OS" + Install [XCode](https://developer.apple.com/xcode/) from the App Store and CMake & Clang from brew: -Install [XCode](https://developer.apple.com/xcode/) from the App Store and CMake from brew: + ```bash + brew install cmake llvm + ``` -```bash -brew install cmake -``` +=== "Ubuntu" -#### Ubuntu + Install Clang and CMake: -Install Clang and CMake: + ```bash + sudo apt update && \ + apt install -y cmake clang curl pkg-config libssl-dev git build-essential + ``` -```bash -sudo apt update && \ -apt install -y cmake clang curl pkg-config libssl-dev git build-essential -``` +=== "Arch" -#### Arch Linux + Install Clang and CMake: -Install Clang and CMake: + ```bash + sudo pacman -Syu base-devel clang cmake git + ``` -```bash -sudo pacman -Syu base-devel clang cmake git -``` - -#### Windows +=== "Windows" -Install [Visual Studio Community Edition](https://visualstudio.microsoft.com/vs/community/). -Make sure to include CMake in the installation. + Install [Visual Studio Community Edition](https://visualstudio.microsoft.com/vs/community/). + Make sure to include CMake in the installation. -### Rust compiler +#### Rust compiler -Since PgDog is written in Rust, make sure to install the latest version of the [compiler](https://rust-lang.org). +PgDog is written in Rust and uses the latest stable features of the language. Make sure to install the newest version of the toolchain from [rust-lang.org](https://rust-lang.org). ### Compile PgDog -PgDog source code can be downloaded from [GitHub](https://github.com/pgdogdev/pgdog): +Clone the code from our GitHub repository: ```bash git clone https://github.com/pgdogdev/pgdog && \ cd pgdog ``` -PgDog should be compiled in release mode to make sure you get all performance benefits. You can do this with Cargo: +To make sure you get all performance benefits, PgDog should be compiled in release mode: ```bash cargo build --release @@ -78,7 +76,7 @@ cargo build --release ### Launch PgDog -Starting PgDog can be done by running the binary in `target/release` folder or with Cargo: +You can start PgDog by running the binary directly, located in `target/release/pgdog`, or with Cargo: ```bash cargo run --release @@ -86,30 +84,29 @@ cargo run --release ## Configuration -PgDog is [configured](configuration/index.md) via two files: +PgDog is configured via 2 files: -* [`pgdog.toml`](configuration/index.md) which contains general pooler settings and PostgreSQL server information -* [`users.toml`](configuration/users.toml/users.md) which contains passwords for users allowed to connect to the pooler +| Configuration file | Description | +|-|-| +| [pgdog.toml](configuration/index.md) | Contains general settings and info about PostgreSQL servers. | +| [users.toml](configuration/users.toml/users.md) | Contains users and passwords that are allowed to connect to PgDog. | -The passwords are stored in a separate file to simplify deployments in environments where -secrets can be safely encrypted, like Kubernetes or AWS EC2. +Users are configured separately to allow them to be encrypted at rest in environments that support it, like in Kubernetes or with the AWS Secrets Manager. -Both files can to be placed in the current working directory (`$PWD`) for PgDog to detect them. Alternatively, -you can specify their location when starting PgDog, using the `--config` and `--users` arguments: +Both config files should be placed in the current working directory (`$PWD`) for PgDog to detect them. Alternatively, +you can pass their paths at startup as arguments: ```bash -pgdog --config /path/to/pgdog.toml --users path/to/users.toml +pgdog \ + --config /path/to/pgdog.toml \ + --users path/to/users.toml ``` -#### `pgdog.toml` +#### pgdog.toml -Most PgDog configuration options have sensible defaults. This allows a basic, single database configuration, to be pretty short: +Most configuration options have sensible defaults. This makes single database configuration pretty short: ```toml -[general] -host = "0.0.0.0" -port = 6432 - [[databases]] name = "postgres" host = "127.0.0.1" @@ -117,7 +114,7 @@ host = "127.0.0.1" #### `users.toml` -This configuration file contains a mapping between databases, users and passwords. Unless you configured [passthrough authentication](features/authentication.md#passthrough-authentication), users not specified in this file won't be able to connect to PgDog: +This config file contains a mapping between databases, users and passwords. Unless you configured [passthrough authentication](features/authentication.md#passthrough-authentication), users not specified in this file won't be able to connect: ```toml [[users]] @@ -127,8 +124,9 @@ password = "hunter2" ``` !!! note + PgDog creates connection pools for each user/database pair. If no users are specified in `users.toml`, - connection pools will not be created at pooler startup. + all connection pools will be disabled and connections to Postgres will not be created. ## Next steps diff --git a/docs/roadmap.md b/docs/roadmap.md index 2dcc32b..07df42d 100644 --- a/docs/roadmap.md +++ b/docs/roadmap.md @@ -27,7 +27,7 @@ Features around query execution in a direct-to-shard or multi-shard context. | Cross-shard queries | Queries spanning multiple shards are supported, for most simple use cases. See below for details. | | Cross-shard sorting | `SELECT ... ORDER BY ...`-style queries work automatically. Data types supported in the `ORDER BY` clause are: `BIGINT`, `INTEGER`, `TEXT`/`VARCHAR`. Missing: dates/timestamps, other Postgres types. | | Cross-shard aggregates | Basic aggregates like `count`, `max`, `min`, `sum` are supported with/without `GROUP BY` clause. Missing aggregates include: `avg`, `percentile_cont` (and `disc`), `JSON`, and others. Some require query rewriting. | -| Query rewriting | Rewriting queries is only supported for renaming prepared statements. Query rewriting to support aggregates or cross-shard joins is not yet. | +| Query rewriting | Rewriting queries is only supported for renaming prepared statements. Query rewriting to support aggregates or cross-shard joins is not yet implemented. | | Cross-shard joins | Not supported yet. Requires query rewriting and implementing inner/outer hash joins inside PgDog. |