17.5. Write Ahead Log

Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.

See also Section 28.4 for details on WAL and checkpoint tuning.

17.5.1. Settings

wal_level (enum)

wal_level determines how much information is written to the WAL. The default value is minimal, which writes only the information needed to recover from a crash or immediate shutdown. archive adds logging required for WAL archiving, and hot_standby further adds information required to run read-only queries on a standby server. This parameter can only be set at server start.

In minimal level, WAL-logging of some bulk operations, like CREATE INDEX, CLUSTER and COPY on a table that was created or truncated in the same transaction can be safely skipped, which can make those operations much faster (see Section 14.4.7). But minimal WAL does not contain enough information to reconstruct the data from a base backup and the WAL logs, so either archive or hot_standby level must be used to enable WAL archiving (archive_mode) and streaming replication.

fsync (boolean)

If this parameter is on, the Postgres-XC server will try to make sure that updates are physically written to disk, by issuing fsync() system calls or various equivalent methods (see wal_sync_method). This ensures that the database cluster can recover to a consistent state after an operating system or hardware crash.

While turning off fsync is often a performance benefit, this can result in unrecoverable data corruption in the event of a power failure or system crash. Thus it is only advisable to turn off fsync if you can easily recreate your entire database from external data.

Examples of safe circumstances for turning off fsync include the initial loading of a new database cluster from a backup file, using a database cluster for processing a batch of data after which the database will be thrown away and recreated, or for a read-only database clone which gets recreated frequently and is not used for failover. High quality hardware alone is not a sufficient justification for turning off fsync.

In many situations, turning off synchronous_commit for noncritical transactions can provide much of the potential performance benefit of turning off fsync, without the attendant risks of data corruption.

fsync can only be set in the postgresql.conf file or on the server command line. If you turn this parameter off, also consider turning off full_page_writes.

synchronous_commit (boolean)

Specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a "success" indication to the client. Valid values are on, local, and off. The default, and safe, value is on. When off, there can be a delay between when success is reported to the client and when the transaction is really guaranteed to be safe against a server crash. (The maximum delay is three times wal_writer_delay.) Unlike fsync, setting this parameter to off does not create any risk of database inconsistency: an operating system or database crash might result in some recent allegedly-committed transactions being lost, but the database state will be just the same as if those transactions had been aborted cleanly. So, turning synchronous_commit off can be a useful alternative when performance is more important than exact certainty about the durability of a transaction. For more discussion see Section 28.3.

If synchronous_standby_names is set, this parameter also controls whether or not transaction commit will wait for the transaction's WAL records to be flushed to disk and replicated to the standby server. The commit wait will last until a reply from the current synchronous standby indicates it has written the commit record of the transaction to durable storage. If synchronous replication is in use, it will normally be sensible either to wait both for WAL records to reach both the local and remote disks, or to allow the transaction to commit asynchronously. However, the special value local is available for transactions that wish to wait for local flush to disk, but not synchronous replication.

This parameter can be changed at any time; the behavior for any one transaction is determined by the setting in effect when it commits. It is therefore possible, and useful, to have some transactions commit synchronously and others asynchronously. For example, to make a single multistatement transaction commit asynchronously when the default is the opposite, issue SET LOCAL synchronous_commit TO OFF within the transaction.

wal_sync_method (enum)

Method used for forcing WAL updates out to disk. If fsync is off then this setting is irrelevant, since WAL file updates will not be forced out at all. Possible values are:

  • open_datasync (write WAL files with open() option O_DSYNC)

  • fdatasync (call fdatasync() at each commit)

  • fsync (call fsync() at each commit)

  • fsync_writethrough (call fsync() at each commit, forcing write-through of any disk write cache)

  • open_sync (write WAL files with open() option O_SYNC)

The open_* options also use O_DIRECT if available. Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform, except that fdatasync is the default on Linux. The default is not necessarily ideal; it might be necessary to change this setting or other aspects of your system configuration in order to create a crash-safe configuration or achieve optimal performance. These aspects are discussed in Section 28.1. This parameter can only be set in the postgresql.conf file or on the server command line.

full_page_writes (boolean)

When this parameter is on, the Postgres-XC server writes the entire content of each disk page to WAL during the first modification of that page after a checkpoint. This is needed because a page write that is in process during an operating system crash might be only partially completed, leading to an on-disk page that contains a mix of old and new data. The row-level change data normally stored in WAL will not be enough to completely restore such a page during post-crash recovery. Storing the full page image guarantees that the page can be correctly restored, but at the price of increasing the amount of data that must be written to WAL. (Because WAL replay always starts from a checkpoint, it is sufficient to do this during the first change of each page after a checkpoint. Therefore, one way to reduce the cost of full-page writes is to increase the checkpoint interval parameters.)

Turning this parameter off speeds normal operation, but might lead to either unrecoverable data corruption, or silent data corruption, after a system failure. The risks are similar to turning off fsync, though smaller, and it should be turned off only based on the same circumstances recommended for that parameter.

Turning off this parameter does not affect use of WAL archiving for point-in-time recovery (PITR) (see Section 23.3).

This parameter can only be set in the postgresql.conf file or on the server command line. The default is on.

wal_buffers (integer)

The amount of shared memory used for WAL data that has not yet been written to disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of shared_buffers, but not less than 64kB nor more than the size of one WAL segment, typically 16MB. This value can be set manually if the automatic choice is too large or too small, but any positive value less than 32kB will be treated as 32kB. This parameter can only be set at server start.

The contents of the WAL buffers are written out to disk at every transaction commit, so extremely large values are unlikely to provide a significant benefit. However, setting this value to at least a few megabytes can improve write performance on a busy server where many clients are committing at once. The auto-tuning selected by the default setting of -1 should give reasonable results in most cases.

Increasing this parameter might cause Postgres-XC to request more System V shared memory than your operating system's default configuration allows. See Section 16.4.1 for information on how to adjust those parameters, if necessary.

wal_writer_delay (integer)

Specifies the delay between activity rounds for the WAL writer. In each round the writer will flush WAL to disk. It then sleeps for wal_writer_delay milliseconds, and repeats. The default value is 200 milliseconds (200ms). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

commit_delay (integer)

When the commit data for a transaction is flushed to disk, any additional commits ready at that time are also flushed out. commit_delay adds a time delay, set in microseconds, before a transaction attempts to flush the WAL buffer out to disk. A nonzero delay can allow more transactions to be committed with only one flush operation, if system load is high enough that additional transactions become ready to commit within the given interval. But the delay is just wasted if no other transactions become ready to commit. Therefore, the delay is only performed if at least commit_siblings other transactions are active at the instant that a server process has written its commit record. The default commit_delay is zero (no delay). Since all pending commit data will be written at every flush regardless of this setting, it is rare that adding delay by increasing this parameter will actually improve performance.

commit_siblings (integer)

Minimum number of concurrent open transactions to require before performing the commit_delay delay. A larger value makes it more probable that at least one other transaction will become ready to commit during the delay interval. The default is five transactions.

17.5.2. Checkpoints

Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.

checkpoint_segments (integer)

Maximum number of log file segments between automatic WAL checkpoints (each segment is normally 16 megabytes). The default is three segments. Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line.

checkpoint_timeout (integer)

Maximum time between automatic WAL checkpoints, in seconds. The default is five minutes (5min). Increasing this parameter can increase the amount of time needed for crash recovery. This parameter can only be set in the postgresql.conf file or on the server command line.

checkpoint_completion_target (floating point)

Specifies the target of checkpoint completion, as a fraction of total time between checkpoints. The default is 0.5. This parameter can only be set in the postgresql.conf file or on the server command line.

checkpoint_warning (integer)

Write a message to the server log if checkpoints caused by the filling of checkpoint segment files happen closer together than this many seconds (which suggests that checkpoint_segments ought to be raised). The default is 30 seconds (30s). Zero disables the warning. This parameter can only be set in the postgresql.conf file or on the server command line.

17.5.3. Archiving

Note: The following description applies both to Postgres-XC and PostgreSQL if not described explicitly.

archive_mode (boolean)

When archive_mode is enabled, completed WAL segments are sent to archive storage by setting archive_command. archive_mode and archive_command are separate variables so that archive_command can be changed without leaving archiving mode. This parameter can only be set at server start. wal_level must be set to archive or hot_standby to enable archive_mode.

archive_command (string)

The shell command to execute to archive a completed WAL file segment. Any %p in the string is replaced by the path name of the file to archive, and any %f is replaced by only the file name. (The path name is relative to the working directory of the server, i.e., the cluster's data directory.) Use %% to embed an actual % character in the command. It is important for the command to return a zero exit status only if it succeeds. For more information see Section 23.3.1.

This parameter can only be set in the postgresql.conf file or on the server command line. It is ignored unless archive_mode was enabled at server start. If archive_command is an empty string (the default) while archive_mode is enabled, WAL archiving is temporarily disabled, but the server continues to accumulate WAL segment files in the expectation that a command will soon be provided. Setting archive_command to a command that does nothing but return true, e.g. /bin/true (REM on Windows), effectively disables archiving, but also breaks the chain of WAL files needed for archive recovery, so it should only be used in unusual circumstances.

archive_timeout (integer)

The archive_command is only invoked for completed WAL segments. Hence, if your server generates little WAL traffic (or has slack periods where it does so), there could be a long delay between the completion of a transaction and its safe recording in archive storage. To limit how old unarchived data can be, you can set archive_timeout to force the server to switch to a new WAL segment file periodically. When this parameter is greater than zero, the server will switch to a new segment file whenever this many seconds have elapsed since the last segment file switch, and there has been any database activity, including a single checkpoint. (Increasing checkpoint_timeout will reduce unnecessary checkpoints on an idle system.) Note that archived files that are closed early due to a forced switch are still the same length as completely full files. Therefore, it is unwise to use a very short archive_timeout — it will bloat your archive storage. archive_timeout settings of a minute or so are usually reasonable. This parameter can only be set in the postgresql.conf file or on the server command line.

17.5.4. Streaming Replication

Note: The following description applies only to PostgreSQL

Streaming replication has not been tested with Postgres-XC yet. Because this version of streaming replication is based upon asynchronous log shipping, there could be a risk to have the status of Coordinators and Datanodes inconsistent. The development team leaves the test and the use of this entirely to users.

These settings control the behavior of the built-in streaming replication feature. These parameters would be set on the primary server that is to send replication data to one or more standby servers.

max_wal_senders (integer)

Specifies the maximum number of concurrent connections from standby servers or streaming base backup clients (i.e., the maximum number of simultaneously running WAL sender processes). The default is zero. This parameter can only be set at server start. wal_level must be set to archive or hot_standby to allow connections from standby servers.

wal_sender_delay (integer)

Specifies the delay between activity rounds for WAL sender processes. In each round the WAL sender sends any WAL accumulated since the last round to the standby server. It then sleeps for wal_sender_delay milliseconds, and repeats. The sleep is interrupted by transaction commit, so the effects of a committed transaction are sent to standby servers as soon as the commit happens, regardless of this setting. The default value is one second (1s). Note that on many systems, the effective resolution of sleep delays is 10 milliseconds; setting wal_sender_delay to a value that is not a multiple of 10 might have the same results as setting it to the next higher multiple of 10. This parameter can only be set in the postgresql.conf file or on the server command line.

wal_keep_segments (integer)

Specifies the minimum number of past log file segments kept in the pg_xlog directory, in case a standby server needs to fetch them for streaming replication. Each segment is normally 16 megabytes. If a standby server connected to the primary falls behind by more than wal_keep_segments segments, the primary might remove a WAL segment still needed by the standby, in which case the replication connection will be terminated. (However, the standby server can recover by fetching the segment from archive, if WAL archiving is in use.)

This sets only the minimum number of segments retained in pg_xlog; the system might need to retain more segments for WAL archival or to recover from a checkpoint. If wal_keep_segments is zero (the default), the system doesn't keep any extra segments for standby purposes, and the number of old WAL segments available to standby servers is a function of the location of the previous checkpoint and status of WAL archiving. This parameter has no effect on restartpoints. This parameter can only be set in the postgresql.conf file or on the server command line.

vacuum_defer_cleanup_age (integer)

Specifies the number of transactions by which VACUUM and HOT updates will defer cleanup of dead row versions. The default is zero transactions, meaning that dead row versions can be removed as soon as possible, that is, as soon as they are no longer visible to any open transaction. You may wish to set this to a non-zero value on a primary server that is supporting hot standby servers, as described in Section 24.5. This allows more time for queries on the standby to complete without incurring conflicts due to early cleanup of rows. However, since the value is measured in terms of number of write transactions occurring on the primary server, it is difficult to predict just how much additional grace time will be made available to standby queries. This parameter can only be set in the postgresql.conf file or on the server command line.

You should also consider setting hot_standby_feedback as an alternative to using this parameter.

replication_timeout (integer)

Terminate replication connections that are inactive longer than the specified number of milliseconds. This is useful for the primary server to detect a standby crash or network outage. A value of zero disables the timeout mechanism. This parameter can only be set in the postgresql.conf file or on the server command line. The default value is 60 seconds.

To prevent connections from being terminated prematurely, wal_receiver_status_interval must be enabled on the standby, and its value must be less than the value of replication_timeout.

17.5.5. Synchronous Replication

These settings control the behavior of the built-in synchronous replication feature. These parameters would be set on the primary server that is to send replication data to one or more standby servers.

synchronous_standby_names (string)

Specifies a priority ordered list of standby names that can offer synchronous replication. At any one time there will be at most one synchronous standby that will wake sleeping users following commit. The synchronous standby will be the first named standby that is both currently connected and streaming in real-time to the standby (as shown by a state of "STREAMING"). Other standby servers with listed later will become potential synchronous standbys. If the current synchronous standby disconnects for whatever reason it will be replaced immediately with the next highest priority standby. Specifying more than one standby name can allow very high availability.

The standby name is currently taken as the application_name of the standby, as set in the primary_conninfo on the standby. Names are not enforced for uniqueness. In case of duplicates one of the standbys will be chosen to be the synchronous standby, though exactly which one is indeterminate. The special entry * matches any application_name, including the default application name of walreceiver.

If no synchronous standby names are specified, then synchronous replication is not enabled and transaction commit will never wait for replication. This is the default configuration. Even when synchronous replication is enabled, individual transactions can be configured not to wait for replication by setting the synchronous_commit parameter to local or off.

17.5.6. Standby Servers

These settings control the behavior of a standby server that is to receive replication data.

max_standby_archive_delay (integer)

max_standby_archive_delay applies when WAL data is being read from WAL archive (and is therefore not current). The default is 30 seconds. Units are milliseconds if not specified. A value of -1 allows the standby to wait forever for conflicting queries to complete. This parameter can only be set in the postgresql.conf file or on the server command line.

Note that max_standby_archive_delay is not the same as the maximum length of time a query can run before cancellation; rather it is the maximum total time allowed to apply any one WAL segment's data. Thus, if one query has resulted in significant delay earlier in the WAL segment, subsequent conflicting queries will have much less grace time.

max_standby_streaming_delay (integer)

max_standby_streaming_delay applies when WAL data is being received via streaming replication. The default is 30 seconds. Units are milliseconds if not specified. A value of -1 allows the standby to wait forever for conflicting queries to complete. This parameter can only be set in the postgresql.conf file or on the server command line.

Note that max_standby_streaming_delay is not the same as the maximum length of time a query can run before cancellation; rather it is the maximum total time allowed to apply WAL data once it has been received from the primary server. Thus, if one query has resulted in significant delay, subsequent conflicting queries will have much less grace time until the standby server has caught up again.

wal_receiver_status_interval (integer)

Specifies the minimum frequency, in seconds, for the WAL receiver process on the standby to send information about replication progress to the primary, where they can be seen using the pg_stat_replication view. The standby will report the last transaction log position it has written, the last position it has flushed to disk, and the last position it has applied. Updates are sent each time the write or flush positions changed, or at least as often as specified by this parameter. Thus, the apply position may lag slightly behind the true position. Setting this parameter to zero disables status updates completely. This parameter can only be set in the postgresql.conf file or on the server command line. The default value is 10 seconds.

When replication_timeout is enabled on the primary, wal_receiver_status_interval must be enabled, and its value must be less than the value of replication_timeout.