QHB Metrics

QHB metrics are currently cumulated at the database cluster level and apply to the all QHB processes regardless of whether they are background processes or processes maintaining user connections. Metric data are sent to the metric collector and aggregator metricsd; short description of metricsd you can find in Chapter Metric Server. Then metrics are aggregated and stored in Graphite; Grafana is used as an interface for it.



Metric Types

  • Gauge — non-negative integer value. It can be set, increased or decreased by given number.
  • Counter — non-negative integer value. It can only be increased by given number.
  • Timer — non-negative integer value, duration time in nanoseconds. The value can only be stored, some statistical characteristics are computed during aggregation:
    • sum — sum of values
    • count — number of stored values
    • max — maximum of stored values
    • min — minimum of stored values
    • mean — arithmetic mean
    • median — median value
    • std — standard deviation
    • percentile list if needed


Metric Groups

sys

The sys group includes various system metrics.

There are two global literals that control collection of system metrics:

  • qhb_os_monitoring — Boolean literal, the default is off. When set to on, it enables the periodical collection of metrics.
  • qhb_os_stat_period — the length of time in seconds between collections of system statistics, the default is 30 seconds.

Table 48. Load and CPU Metrics

NameDescriptionType
sys.load_average.1minload average over the last minutegauge
sys.load_average.5minThe same but over the last 5 minutesgauge
sys.load_average.15minThe same but over the last 15 minutesgauge
sys.cpu.1minCPU load percent over the last minutegauge
sys.cpu.5minThe same but over the last 5 minutesgauge
sys.cpu.15minThe same but over the last 15 minutesgauge

Note
load average metric values include values with two decimals precision, however due to technical intricacy during data sending raw values will be multiplied by 100 and sent as integers. Thus when printing data you should take this feature into account and divide resulting values by 100.

load average is an average number of running plus waiting threads over the specified amount of time (1, 5 and 15 minutes). Normally corresponding values are printed using uptime, top, or cat /proc/loadavg commands. See a translation of article Load Average in Linux for details about this metric's features.

If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing; if the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing. However this metric itself has a little meaning without total CPU count. Additional sys.cpu.Nmin metrics for CPU load (derivative of load average) show an approximate percent of CPUs load with account of their number and are calculated using the following simplified formula:

sys.cpu.<N>min = sys.load_average.<N>min / cpu_count * 100

where cpu_count is a number of CPU in system and N accepts the values 1, 5 or 15.

CPU number is calculated as product of physical sockets number, kernel number per socket, and tread number per kernel. Command lscpu shows all the necessary data. For example:

Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1

In this case cpu_count = 2 * 4 * 1 = 8.

An alternative and simpler method would be to obtain this value via the nproc command.

Thus, in this case, 100% load will be achieved if the value load average tends to 8. However, these calculations and values ​​will be rather approximate and even assumed, as shown in the article linked above.

Table 49. Memory Usage Metrics

NameDescriptionType
sys.mem.totalTotal size of set RAM memorygauge
sys.mem.usedUsed memorygauge
sys.mem.freeFree memorygauge
sys.mem.availableMemory available for running applications (not including swap, but taking into account potentially freed memory occupied by the page cache)gauge
sys.swap.totalTotal size of the swap filegauge
sys.swap.freeUnused swap file memorygauge

The metric values match to the output fields of the free utility (the values ​​are displayed in kilobytes, i.e. match to the output of free -k).

Table 50. Matching of metric values ​​to output fields of the free utility

MetricUtility free field
sys.mem.totalMem:total
sys.mem.usedMem:used
sys.mem.freeMem:free
sys.mem.availableMem:available
sys.swap.totalSwap:total
sys.swap.freeSwap:free

You can calculate the value corresponding to the value of Mem:buff/cache output of the free utility using the formula:

Mem:buff/cache = Mem:total - Mem:used - Mem:free

Thus, in Grafana you can calculate and display this value based on other available data, using the diffSeries function.

The Mem:shared value (tmpfs virtual file system data) is not displayed via metrics.

The Swap:used value can be calculated using the formula:

Swap:used = Swap:total - Swap:free

This value can also be displayed in Grafana as a calculated value via diffSeries function.

A more detailed description of these indicators can be obtained through the OS help system for the free utility (To do this, call man free).

Table 51. Disk Usage Metrics

NameDescriptionType
sys.disk_space.totalThe disk system space where the data directory is located, in bytesgauge
sys.disk_space.freeFree space on the disk system where the data directory is located, in bytesgauge

Metrics are applied to the disk system where the directory with the database files is located. This directory is determined by the -D command line parameter when starting the database or by the $PGDATA environment variable. The data_directory parameter in the qhb.conf configuration file can override the location of the data directory.

Table 52. Other System Metrics

NameDescriptionType
sys.processesThe total number of processes running in the systemgauge
sys.uptimeThe number of seconds that have passed since the system startedgauge

db_stat

The db_stat group contains metrics for block reads and writes at the QHB instance level.

Table 53. Block Read and Write Metrics

NameDescriptionType
qhb.db_stat.numbackendsNumber of active server processesgauge
qhb.db_stat.blocks_fetchedNumber of blocks received during readingcounter
qhb.db_stat.blocks_hitNumber of blocks found in cache during readingcounter
qhb.db_stat.blocks_read_timeBlocks read time, in millisecondscounter
qhb.db_stat.blocks_write_timeBlocks write time, in millisecondscounter
qhb.db_stat.conflicts.tablespaceNumber of queries cancelled due to dropped tablespacescounter
qhb.db_stat.conflicts.lockNumber of queries cancelled due to lock timeoutscounter
qhb.db_stat.checksum_failuresNumber of data page checksum failurescounter
qhb.db_stat.conflicts.snapshotNumber of queries cancelled due to old snapshotscounter
qhb.db_stat.conflicts.bufferpinNumber of queries cancelled due to pinned bufferscounter
qhb.db_stat.conflicts.startup_deadlockNumber of queries cancelled due to deadlockscounter
qhb.db_stat.tuples.returnedNumber of tuples fetched via sequential scancounter
qhb.db_stat.tuples.fetchedNumber of tuples fetched via index scancounter
qhb.db_stat.tuples.insertedNumber of tuples inserted to the databasecounter
qhb.db_stat.tuples.updatedNumber of tuples updated in the databasecounter
qhb.db_stat.tuples.deletedNumber of tuples deleted from the databasecounter

Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.db_stat.db_16384.numbackends.

The cache hit rate is calculated based on the qhb.db_stat.blocks_fetched and qhb.db_stat.blocks_hit metrics:

k = blocks_hit / blocks_fetched * 100%

A value greater than 90% is usually considered a good level. If the coefficient value is significantly lower than this mark, it is recommended to consider increasing the buffer cache size.


bgwr

The bgwr group contains background writing process metrics.

Table 54. Background Write Process Metrics

NameDescriptionType
qhb.bgwr.checkpoints_timedNumber of planned checkpoints that were executedcounter
qhb.bgwr.checkpoints_reqnumber of requested checkpoints performed out of the scheduled ordercounter
qhb.bgwr.checkpoint_write_timeTime spent in the checkpoint processing stage when files are written to disk, in millisecondscounter
qhb.bgwr.checkpoint_sync_timeTime spent in the checkpoint processing stage when files are synchronized to disk, in millisecondscounter
qhb.bgwr.buffers_checkpointNumber of buffers written during checkpointscounter
qhb.bgwr.buffers_cleanNumber of buffers written by the background writercounter
qhb.bgwr.maxwritten_cleanNumber of times the background writer stopped flushing dirty pages because it was writing too many bufferscounter
qhb.bgwr.buffers_backendNumber of buffers written directly by the backendcounter
qhb.bgwr.buffers_backend_fsyncNumber of times the backend called fsync itself (usually it is handled by the background writer even when the backend is writing itself)counter
qhb.bgwr.buffers_allocNumber of allocated bufferscounter

Data on the listed metrics is displayed in the "Checkpoints and Buffer Operations" section of the QHB dashboard. Typically, when scheduled checkpoints are performed, information about the start of the checkpoint is first recorded, then blocks are flushed to disk for some time, and at the end of the checkpoint, information about the recording duration and data synchronization is recorded. In the case of scheduled checkpoint processing, block writes are evenly distributed over time according to the settings to reduce the impact of this process on overall I/O. In the case of checkpoints requested via a command, block writes occur immediately, without artificial delay.

Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.bgwr.db_16384.buffers_backend.


mem_stat

The mem_stat group contains metrics for tracking the size of the work area memory.

The work area memory can be roughly described as the memory used by internal sorting operations and hash tables. The size of this memory is controlled by the global configuration parameter work_mem: when the limit is exceeded, temporary files on the disk are used instead of RAM. The work_mem parameter limits the memory used by a single operation.

The mem_stat group includes three types of metrics with the common name work_area:

  • qhb.mem_stat.work_area shows the total size of work area for the entire cluster, in bytes
  • qhb.mem_stat.db_.work_area shows the size of work area for one database, in bytes
  • qhb.mem_stat..work_area shows the size of work area for one worker process, in bytes

Metrics of the second and third types are created dynamically.

All metrics are of type gauge, meaning their values ​​may decrease as previously occupied memory is freed.


temp_stat

The temp_stat group includes metrics for QHB temporary files and tables.

Table 55. Metrics for temporary files and tables

NameDescriptionType
qhb.temp_stat.temp_filesNumber of temporary files created during the aggregation periodcounter
qhb.temp_stat.temp_bytesTotal size of temporary files created during the aggregation period, in bytescounter
qhb.temp_stat.temp_tablesNumber of temporary tables created during the aggregation periodcounter
qhb.temp_stat.temp_table_bytesTotal size of temporary tables created during the aggregation period, in bytescounter

Data is collected for the aggregation period of the metric collector. Data is updated after the corresponding operation is completed, for example, when creating a temporary table with the DROP ON COMMIT clause, statistics are updated after the COMMIT command is executed during deleting the temporary table. In some cases, the space occupied by temporary files and tables can be quite significant.

Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.temp_stat.db_16384.temp_bytes.


wal

The wal group contains metrics for the write-ahead log archiving process.

Table 56. WAL Archiving Process Metrics

NameDescriptionType
qhb.wal.archivedNumber of successful write-ahead log file archiving operations.counter
qhb.wal.failedNumber of failed archiving attemptscounter
qhb.wal.archive_timeTime spent copying log files, in nanosecondscounter

These metrics work if write-ahead log archiving is configured by setting archive_mode to on and defining the archiving command in archive_command.


transaction

The transaction group includes transaction metrics.

Table 57. Transaction Metrics

NameDescriptionType
qhb.transaction.commitNumber of transaction commitscounter
qhb.transaction.rollbackNumber of transaction rollbackscounter
qhb.transaction.deadlocksNumber of deadlockscounter

Metric data is collected directly when committing and rolling back transactions at the database cluster level.

The transaction commit ratio is calculated as a percentage of transaction commits to the sum of transaction commits and rollbacks:

k = commit/(commit + rollback)*100%

Typically, the value tends to 100%, since transactions are usually completed successfully. A significant portion of rolled-back transactions may indicate that there are problems in the system.

Deadlocks occur when different sessions mutually wait for locked data to be released. In this case, after the deadlock is automatically detected, one of the transactions is rolled back.


wait

The wait group contains wait event metrics.

This set of metrics fully corresponds to the set of standard wait events. The metrics have the prefix qhb.wait. Also the name contains the wait event class and the wait event name after a dot. In the current release, metric names are limited in size and have a maximum of 31 characters in the name. All wait event metrics have the counter type, but the value reflects the amount of time in microseconds that these waits took as a whole across all sessions during the collection period.

Note
If there were many user connections in the waiting state during the observation period, the total waiting time could be many times greater than the observation period. For example, if 1,000 sessions spent one second waiting over 10 seconds, the total waiting time would be 1,000 seconds.

Metrics with wait event types Lock, LWLock, IO, and IPC are displayed in the "Wait Events" section of the QHB dashboard. The metric values ​​are displayed in microseconds (automatically converted to other units when the values increase). The existing graphs do not display all wait events, but only the five most significant by value on each graph. Different wait events may have very different durations. Great value variability ​​may reflect emerging problems.

Table 58. The Most Significant Wait Events

NameDescription
Lock.extendWaiting for relation to be extended. Becomes noticeable when tables are actively growing. The need to allocate new blocks leads to some delays, which are reflected in this metric.
Lock.transactionidWaiting for transaction to be completed. It occurs when a transaction is forced to wait for previous transactions to complete their processing, which also receive confirmation of their completion.
Lock.tupleWaiting for acquiring a lock on a tuple. It occurs when multiple transactions are working on the same data at the same time.
LWLock.WALWriteLockWaiting for WAL buffers to be written to disk. Often it is the leader of this type of wait event, as disk operations are the slowest of this wait event group.
LWLock.wal_insertWaiting for WAL to be inserted into memory buffer.
LWLock.buffer_contentWaiting for a data page to be read or written to memory. It occurs and becomes significant during intensive I/O.
LWLock.buffer_mappingWaiting for a data block to be mapping with a buffer in the buffer pool.
LWLock.lock_managerWaiting for locks for backend processes to be added or checked. Event becomes significant when transactions are frequent.

bufmgr

The bufmgr group contains buffer manager metrics related to memory management mechanisms. Metrics are collected during data read operations.

Table 59. Buffer Manager Metrics

NameDescriptionTypeAggregate
qhb.bufmgr.BufferAllocNumber of times the buffer was searchedtimercount
qhb.bufmgr.BufferAllocTotal time spent searching for the buffertimersum
qhb.bufmgr.happy_pathNumber of times the buffer was found immediatelytimercount
qhb.bufmgr.happy_pathTotal time spent searching when the buffer was found immediatelytimersum
qhb.bufmgr.cache_missNumber of buffer cache missestimercount
qhb.bufmgr.cache_missTotal time spent processing buffer cache missestimersum
qhb.bufmgr.disk_readNumber of tuple reads from disk (asynchronous)timercount
qhb.bufmgr.flush_dirtyNumber of times the tuple was flushed to disk (asynchronous)timercount
qhb.bufmgr.retry_counterNumber of miss re-processingscounter
qhb.bufmgr.strategy_pop_cntNumber of times a specific buffer popping or eviction strategy was triggeredcounter
qhb.bufmgr.strategy_reject_cntNumber of rejected buffers proposed by the special strategycounter
tarq_cache.allocateNumber of times the TARQ search was performedtimercount
tarq_cache.allocateTotal time spent on the TARQ searchtimersum
tarq_cache.allocate_newNumber of times the block to be excluded was selected in TARQtimercount
tarq_cache.rollbackNumber of eviction rollbacks in TARQtimercount
tarq_cache.rollbackTotal time spent on eviction rollbacks in TARQtimersum
tarq_cache.touchTotal time spent counting popular tuples in TARQtimersum

lmgr

The lmgr group contains lock manager metrics. Locks are either primary or predicate. Predicate locks are called "locks" for historical reasons; they do not currently lock anything, but are used to track data dependencies between transactions.

The lock manager metrics track the number of busy locks and serializable transactions (transactions with the highest isolation level, Serializable), as well as the number of free slots in the hash tables for each parameter. These metrics are necessary because the sizes of the hash tables storing locks and serializable transactions are limited, and exceeding the limit may cause an out-of-memory message.

The metrics are collected during data read operations.

Table 60. Lock Manager Metrics

NameDescriptionType
qhb.lmgr.locksNumber of busy lockscounter
qhb.lmgr.locks_availableNumber of locks that can still be createdcounter
qhb.lmgr.proc_locksNumber of busy PGPROC lockscounter
qhb.lmgr.proc_locks_availableNumber of PGPROC locks that can still be createdcounter
qhb.lmgr.serializable_xidsNumber of active serializable transactionscounter
qhb.lmgr.serializable_xids_availableThe number of serializable transactions that can still be createdcounter
qhb.lmgr.pred_locksNumber of busy predicate "locks"counter
qhb.lmgr.pred_locks_availableThe number of predicate "locks" that can still be createdcounter
qhb.lmgr.pred_lock_targetsNumber of busy predicate target "locks"counter
qhb.lmgr.pred_lock_targets_availableThe number of predicate target "locks" that can still be createdcounter

queryid

The queryid group contains metrics that provide the ability to track the planning and execution statistics of (almost) all SQL statements on the server.

This group is distinctive in that its metrics are created dynamically when the engine starts processing a new operator. A unique name or query ID is added to the names of new metrics. For example, the metric qhb.queryid.c92e145f160e7b9e.exec_calls reflects the execution number for a certain SQL statement. The text of the statement itself can be obtained from the new system table qhb_queryid using a query like SELECT * FROM qhb_queryid WHERE qid = 'c92e145f160e7b9e'.

More details on the queryid group metric configuration are described in the Section queryid Metric Configuration.

Table 61. queryid Metrics

NameDescriptionType
qhb.queryid.id.plan_callsNumber of the operator planscounter
qhb.queryid.id.total_plan_timeTotal time spent planning the operator, in nanosecondscounter
qhb.queryid.id.min_plan_timeMinimum time spent planning the operator, in nanosecondsgauge
qhb.queryid.id.max_plan_timeMaximum time spent planning the operator, in nanosecondsgauge
qhb.queryid.id.mean_plan_timeMean time spent planning the operator, in nanosecondsgauge
qhb.queryid.id.stddev_plan_timeStandard deviation of operator planning time, in nanosecondsgauge
qhb.queryid.id.exec_callsNumber of the operator executionscounter
qhb.queryid.id.total_exec_timeTotal time spent executing the operator, in nanosecondscounter
qhb.queryid.id.min_exec_timeMinimum time spent executing the operator, in nanosecondsgauge
qhb.queryid.id.max_exec_timeMaximum time spent executing the operator, in nanosecondsgauge
qhb.queryid.id.mean_exec_timeMean time spent executing the operator, in nanosecondsgauge
qhb.queryid.id.stddev_exec_timeStandard deviation of operator executing time, in nanosecondsgauge
qhb.queryid.id.shared_blks_hitTotal number of shared block cache hits for this operatorcounter
qhb.queryid.id.shared_blks_readTotal number of shared block reads for this operatorcounter
qhb.queryid.id.shared_blks_dirtiedTotal number of shared block dirtied by this operatorcounter
qhb.queryid.id.shared_blks_writtenTotal number of shared blocks written by this operatorcounter
qhb.queryid.id.local_blks_hitTotal number of local block cache hits for this operatorcounter
qhb.queryid.id.local_blks_readTotal number of local block reads for this operatorcounter
qhb.queryid.id.local_blks_dirtiedTotal number of local blocks dirtied by this operatorcounter
qhb.queryid.id.local_blks_writtenTotal number of local blocks written by this operatorcounter
qhb.queryid.id.temp_blks_readTotal number of temporal block reads for this operatorcounter
qhb.queryid.id.temp_blks_writtenTotal number of temporal blocks written by this operatorcounter
qhb.queryid.id.exec_rowsThe total number of rows retrieved or affected by this operatorcounter
qhb.queryid.id.blk_read_timeTotal time spent by the operator reading blocks, in milliseconds (if track_io_timing is enabled; otherwise 0)counter
qhb.queryid.id.blk_write_timeTotal time spent by the operator writing blocks, in milliseconds (if track_io_timing is enabled; otherwise 0)counter
qhb.queryid.id.wal_recordsTotal number of WAL records generated by this operatorcounter
qhb.queryid.id.wal_fpiTotal number of full WAL page images generated by this operatorcounter
qhb.queryid.id.wal_bytesTotal amount of WAL generated by this operator, in bytescounter

qss

The qss group contains metrics related to encryption/decryption processes when working with tables processed by QSS.

Table 62. Metrics for qss tables

NameDescriptionType
qhb.qss2.wals.encrypt.countNumber of encrypted WAL messagescounter
qhb.qss2.wals.encrypt.bytesNumber of encrypted WAL message bytescounter
qhb.qss2.wals.encrypt.timeTime spent encrypting a WAL messagetimer
qhb.qss2.wals.decrypt.countNumber of decrypted WAL messagescounter
qhb.qss2.wals.decrypt.bytesNumber of decrypted WAL message bytescounter
qhb.qss2.wals.decrypt.timeTime spent decrypting a WAL messagetimer
qhb.qss2.pages.encrypt.countNumber of encrypted data pagescounter
qhb.qss2.pages.encrypt.bytesThe number of encrypted data page bytescounter
qhb.qss2.pages.encrypt.timeTime spent encrypting a data pagetimer
qhb.qss2.pages.decrypt.countNumber of decrypted data pagescounter
qhb.qss2.pages.decrypt.bytesNumber of decrypted data page bytescounter
qhb.qss2.pages.decrypt.timeTime spent decrypting a data pagetimer
qhb.qss2.blobs.encrypt.countNumber of encrypted non-fixed size objects (Rbytea)counter
qhb.qss2.blobs.encrypt.bytesThe number of encrypted non-fixed size object bytes (Rbytea)counter
qhb.qss2.blobs.encrypt.timeTime spent encrypting non-fixed size objects (Rbytea)timer
qhb.qss2.blobs.decrypt.countNumber of decrypted non-fixed size objects (Rbytea)counter
qhb.qss2.blobs.decrypt.bytesNumber of decrypted non-fixed size object bytes (Rbytea)counter
qhb.qss2.blobs.decrypt.timeTime spent decrypting non-fixed size objects (Rbytea)timer

These metrics are generated when QSS is running on the server and an encrypted table is created (CREATE TABLE <table name> USING qss).


QCP Connection Pool Metrics

Connection pool metrics that characterize the pool's operation are described below. The metrics are collected during query execution operations.

Table 63. QCP Connection Pool Metrics

NameDescriptionType
qcp.queueThe number of queries in the queue at the momentgauge
qcp.obtain_backendTimeout for backend assignment to fulfill client querytimer
qcp.obtain_backend_failedMaximum timeout for backend assignment to fulfill client query exceededtimer
<dbms address>.in_useNumber of used connections (backends) to the DBMSgauge


Enabling and Disabling Value Recording for Metric Groups and Categories

QHB has a mechanism that allows you to enable and disable metric sending separately for groups and categories. The data directory contains a configuration file metrics_config.toml, in which each string corresponds to a metric group:

[is_enabled]
bgwr = { basic = true, dbid = false, pid = false }
bufmgr = { basic = true, dbid = false, pid = false }
db_stat = { basic = true, dbid = false, pid = false }
default = { basic = true, dbid = false, pid = false }
lmgr = { basic = true, dbid = false, pid = false }
mem_stat = { basic = false, dbid = false, pid = false }
queryid = { basic = false, dbid = false, pid = false }
sys = { basic = true, dbid = false, pid = false }
temp_stat = { basic = true, dbid = false, pid = false }
transaction = { basic = true, dbid = false, pid = false }
wait = { basic = true, dbid = false, pid = false }
wal = { basic = true, dbid = false, pid = false }

The variables take the values ​​true (sending group metrics is allowed) or false (sending is disallowed).

Note
The qss group is included in group bgwr.

Table 64. Setting up Value Recording for Metric Groups

Metric GroupDescriptionDefault
defaultDefault group (various metrics not assigned to separate groups)true
sysOperating system metricstrue
db_statBlock read and write metricstrue
bgwrBackground writer metricstrue
mem_stat"Work area" memory size metricsfalse
temp_statMetrics for temporary files and tablestrue
walWAL file archiving metricstrue
transactionTransaction metricstrue
waitWait event metricstrue
bufmgrMetrics of memory management mechanismstrue
lmgrLock (main and predicate) mechanisms metricstrue
queryidSQL server scheduling and execution metricsfalse

Note
To collect operating system metrics, you must also set the qhb_os_monitoring parameter to on.

To view the list of metric groups and their current state, you can use the SQL function metrics_config_show. An example of calling this function:

SELECT * FROM metrics_config_show();
group_name  | glob  | dbid | pid
-------------+-------+------+-----
bgwr        | t     | f    | f
bufmgr      | t     | f    | f
db_stat     | t     | f    | f
default     | t     | f    | f
lmgr        | t     | f    | f
mem_stat    | f     | f    | f
queryid     | f     | f    | f
sys         | t     | f    | f
temp_stat   | t     | f    | f
transaction | t     | f    | f
wait        | t     | f    | f
wal         | t     | f    | f
(12 rows)

To view the current state of metrics for a specific process, you can use the SQL function metrics_transient_config_show. An integer representing the PID is passed as an argument. An example of calling this function:

SELECT * FROM metrics_transient_config_show(1843);

If the metrics configuration for this process has not been changed, the output is the same as the output of the metrics_config_show function.

In addition to setting the settings via a parameters file, it is possible to change these values ​​via functions. Functions can be called with a different number of arguments. If the kind parameter argument is omitted, the function applies to all metric categories. If the respect_transient_settings parameter argument is omitted, its default value is false.

Changes are immediately reflected in the configuration file.


Functions for Enabling and Disabling Sending Metrics

metrics_config_enable_group ( group_name: cstring ) → void
metrics_config_enable_group ( group_name: cstring, respect_transient_settings: boolean ) → void
metrics_config_enable_group ( group_name: cstring, kind: cstring, respect_transient_settings: boolean ) → void

Enables sending values ​​for metrics in a given group/category.

metrics_config_disable_group ( group_name: cstring ) → void
metrics_config_disable_group ( group_name: cstring, respect_transient_settings: boolean ) → void
metrics_config_disable_group ( group_name: cstring, kind: cstring, respect_transient_settings: boolean ) → void

Disables sending values ​​for metrics of the specified group/category.

metrics_config_transient_set ( backend_pid: int4, group_name: cstring, option_value: boolean ) → void
metrics_config_transient_set ( backend_pid: int4, group_name: cstring, kind: cstring, option_value: boolean ) → void

Sets the time limit for sending metric values ​​for the specified backend.

Metric categories are expressed by the following values ​​of the kind parameter: 'basic', 'dbid', 'pid'.

The parameter respect_transient_settings (second or third depending on the function option) in the functions metrics_config_enable_group and metrics_config_disable_group can be used to specify whether to give priority to the metric group sending parameters set at the backend level.

Using the metrics_config_transient_set function, you can enable or disable sending the values ​​of a specified metric group and category for a specific backend. The setting will last for the entire lifetime of the backend, or until you call the metrics_config_enable_group or metrics_config_disable_group function with the parameter respect_transient_settings = false, which in this case will also extend the effect of the command to those backends in which their own parameter values ​​were set.

When starting a backend with ID that was previously used, all temporary settings for this ID are reset.

Note
Everything said about disabling the recording of metric groups does not apply to metrics that are passed through SQL functions. When using these functions, metrics are always recorded.



QHB Metric Dashboards for Grafana

The QHB dashboards for Grafana are located in the repository at the following link.

QHB is supplied with a metric server that records metric data to Graphite. Grafana serves as the interface to these metrics. The current set of dashboards for Grafana are provided as self-documenting samples, from which users can build dashboards that better suit their needs. Of course, the provided dashboards can also be used as is.


Importing Dashboards

Export of JSON description of dashboards is done in Grafana 6.7.2.

Before importing the JSON description, you need to decide whether the metric names will contain the host name as a prefix. This is how the metric names are structured inside the panels, and it is recommended to leave this option. At the beginning of the metric names, the $server_name variable is added, by default, the value your_host_name is selected for it. Before importing, you can replace this value with the name of one of the hosts in the JSON files. Later, via the Grafana interface, you can add all the host names (comma-separated) from which metrics will be collected into this variable. This will allow you to quickly switch from one host to another when viewing metrics. If such a scheme will not be used (if metrics from a single host are viewed), you can remove the $server_name prefix in all metric names in the JSON files before importing the JSON description. However, this is a more laborious option and is not recommended.

To import dashboard descriptions, you need to follow these steps:

  1. In the Dashboards menu of your Grafana site, select Manage.
  2. In the opened folder and panel list, select an existing folder or create a new one.
  3. In the selected folder, select Import in the upper right part of the page.
  4. On the opened, you can either click the Upload json file button in the upper right corner and upload the file, or paste the contents of the JSON file into the field under the Or paste JSON heading and click the Load button.
  5. Then you need to fill in the required parameters and download the JSON description.

Operating System Dashboard

The dashboard presents the main system indicators:

  • QHB instance running time;
  • Average load;
  • RAM usage;
  • Memory usage;
  • Usage of the disk system where the database directory is located.

QHB Dashboard

The dashboard contains several sections:

  • Transactions;
  • Reading and writing blocks;
  • Wait events;
  • Checkpoints and buffer operations;
  • WAL archiving.

Each section contains sets of thematic dashboards that display key indicators.

Note
By default, tags for various events are not set, so you need to manually configure the output of additional annotations:

  • add the necessary comments to the metric data using the qhb_annotation function (see Section Annotations);
  • configure the output of available annotations in the Grafana dashboard.


QHB Monitoring for Zabbix

Example of using Zabbix 6.2 server to monitor a host with QHB and Zabbix agent.

Required files:

  • template_db_qhb.yaml is the "QHB by Zabbix agent" template file, which needs to be imported to the Zabbix 6.2 server;
  • template_db_qhb.conf is the file with user parameters of Zabbix agent for polling QHB;
  • the qhb/ directory containing the SQL files accessed by the user parameters.

The archive for Zabbix 6.2 or higher is located in the repository at the following link.


Installation

Note
For more information, see the Zabbix documentation on working with agent templates.

  1. Install Zabbix agent on host with QHB

  2. Copy the qhb/ directory to the Zabbix agent home directory /var/lib/zabbix. If the zabbix/ directory is absent in /var/lib/, it must be created. The qhb/ directory contains the SQL files required for getting metrics from QHB:

# mkdir -p /var/lib/zabbix/qhb
# cd /var/lib/zabbix/qhb
# wget <server>/zabbix/qhb/qhb.tar
# tar -xvf qhb.tar
# chmod -R 707 /var/lib/zabbix/qhb
# rm -rf qhb.tar
  1. Copy the template_db_qhb.conf file to the Zabbix agent configuration directory /etc/zabbix/zabbix_agentd.d/:
# wget <server>/zabbix/template_db_qhb.conf
  1. Create a user zbx_monitor with read-only privileges and access to the QHB cluster:
CREATE USER zbx_monitor WITH PASSWORD '<PASSWORD>' INHERIT;
GRANT pg_monitor TO zbx_monitor;
  1. Edit the qhb_hba.conf file to allow connection to Zabbix. To do this, open it in the nano text editor and add the following strings into it:
host all zbx_monitor 127.0.0.1/32 trust
  1. Restart QHB and Zabbix agent:
# systemctl restart qhb
# systemctl restart zabbix_agentd   
  1. Import the template_db_qhb.yaml file of the "QHB by Zabbix agent" template on the Zabbix server. For more information on importing templates, see the Zabbix documentation on working with importing templates.

  2. Set the macro parameters {$PG.HOST}, {$PG.PORT}, {$PG.USER}, {$PG.PASSWORD}, {$PG.DB} for the host with QHB.

  3. Attach the "QHB by Zabbix agent" template to the host with QHB.


Collected Parameters

Table 65. Collected Parameters

GroupNameDescription
QHBBgwriter: Buffers allocated per secondNumber of buffers allocated per second
QHBBgwriter: Buffers written directly by a backend per secondNumber of buffers written directly by a backend per second
QHBBgwriter: Buffers backend fsync per secondNumber of times per second the backend had to call fsync itself (usually it is handled by the background writer even when the backend is writing itself)
QHBBgwriter: Buffers written during checkpoints per secondNumber of buffers written during checkpoints per second
QHBBgwriter: Buffers written by the background writer per secondNumber of buffers written by the background writer per second
QHBBgwriter: Requested checkpoints per secondNumber of requested checkpoints that were executed per second
QHBBgwriter: Scheduled checkpoints per secondThe number of scheduled checkpoints that were executed per second
QHBBgwriter: Checkpoint sync timeTotal time spent synchronizing files to disk while processing checkpoints
QHBBgwriter: Checkpoint write timeTotal time spent writing files to disk during checkpoint processing, in milliseconds
QHBBgwriter: Max written per secondHow many times per second the background writer stopped the flushing scan because it had written too many buffers
QHBStatus: Cache hit ratio %Cache hit rate
QHBStatus: Config hashQHB configuration hash
QHBConnections sum: ActiveTotal number of connections executing queries
QHBConnections sum: IdleTotal number of connections waiting for a new command from the client
QHBConnections sum: Idle in transactionTotal number of connections in transaction state but not executing a query
QHBConnections sum: PreparedTotal number of prepared transactions
QHBConnections sum: TotalTotal number of connections
QHBConnections sum: Total %Total number of connections in percent
QHBConnections sum: WaitingTotal number of waiting transactions
QHBStatus: Ping timePing time
QHBStatus: PingPing
QHBReplication: standby countNumber of standby servers
QHBReplication: lag in secondsReplication lag from primary server in seconds
QHBReplication: recovery roleReplication role: 1 - recovery still ongoing (standby server mode), 0 - primary server mode
QHBReplication: statusReplication status: 0 - streaming is decreasing, 1 - streaming is increasing, 2 - primary server mode
QHBTransactions: Max active transaction timeCurrent maximum duration of an active transaction
QHBTransactions: Max idle transaction timeCurrent maximum duration of an idle transaction
QHBTransactions: Max prepared transaction timeCurrent maximum duration of a prepared transaction
QHBTransactions: Max waiting transaction timeCurrent maximum duration of a waiting transaction
QHBStatus: UptimeTotal system uptime
QHBStatus: VersionQHB version
QHBWAL: Segments countNumber of WAL segments
QHBWAL: Bytes writtenSize of WAL records in bytes
QHBDB {#DBNAME}: Database sizeThis database size
QHBDB {#DBNAME}: Blocks hit per secondNumber of times disk blocks were found already in the buffer cache, so there was no need to read them
QHBDB {#DBNAME}: Disk blocks read per secondTotal number of disk blocks read in this database
QHBDB {#DBNAME}: Detected conflicts per secondTotal number of queries cancelled due to recovery conflicts in this database
QHBDB {#DBNAME}: Detected deadlocks per secondTotal number of deadlocks detected in this database
QHBDB {#DBNAME}: Temp_bytes written per secondThe total amount of data written to temporary files by queries in this database.
QHBDB {#DBNAME}: Temp_files created per secondThe total number of temporary files created by queries in this database.
QHBDB {#DBNAME}: Tuples deleted per secondThe total number of tuples deleted by queries in this database
QHBDB {#DBNAME}: Tuples fetched per secondThe total number of tuples fetched by queries in this database
QHBDB {#DBNAME}: Tuples inserted per secondThe total number of tuples inserted by queries in this database
QHBDB {#DBNAME}: Tuples returned per secondThe total number of tuples returned by queries in this database
QHBDB {#DBNAME}: Tuples updated per secondThe total number of tuples updated by queries in this database
QHBDB {#DBNAME}: Commits per secondNumber of committed transactions in this database
QHBDB {#DBNAME}: Rollbacks per secondTotal number of rolled back transactions in this database
QHBDB {#DBNAME}: Frozen XID before autovacuum %Percentage of frozen XIDs before autovacuum
QHBDB {#DBNAME}: Frozen XID before stop %Percentage of frozen XIDs before stop
QHBDB {#DBNAME}: Locks totalTotal number of locks in this database
QHBDB {#DBNAME}: Queries slow maintenance countSlow maintenance query counter
QHBDB {#DBNAME}: Queries max maintenance timeMaximum duration of a maintenance query
QHBDB {#DBNAME}: Queries sum maintenance timeTotal duration of maintenance queries
QHBDB {#DBNAME}: Queries slow query countSlow query counter
QHBDB {#DBNAME}: Queries max query timeMaximum duration of a query
QHBDB {#DBNAME}: Queries sum query timeTotal duration of queries
QHBDB {#DBNAME}: Queries slow transaction countSlow transaction query counter
QHBDB {#DBNAME}: Queries max transaction timeMaximum duration of a transaction query
QHBDB {#DBNAME}: Queries sum transaction timeTotal duration of transaction queries
QHBDB {#DBNAME}: Index scans per secondNumber of index scans in this database
QHBDB {#DBNAME}: Sequential scans per secondNumber of sequential scans in this database
Zabbix source itemsQHB: Get bgwriterBackground recording process activity statistics
Zabbix source itemsQHB: Get connections sumCollect all metrics from *pg_stat_activity
Zabbix source itemsQHB: Get dbstatCollect all metrics from pg_stat_database for each database
Zabbix source itemsQHB: Get locksCollect all metrics from pg_locks for each database
Zabbix source itemsQHB: Get queriesCollect all metrics on query execution time
Zabbix source itemsQHB: Get transactionsCollect transaction execution time metrics
Zabbix source itemsQHB: Get WALThe main item for collecting WAL metrics
Zabbix source itemsDB {#DBNAME}: Get frozen XIDNumber of frozen XIDs
Zabbix source itemsDB {#DBNAME}: Get scansThe number of scans performed on a table/index in this database

Triggers

Table 66. Triggers

NameDescription
QHB: Required checkpoints occurs too frequentlyRequired checkpoints occur too frequently
QHB: Cache hit ratio too lowToo low cache hit ratio
QHB: Configuration has changedConfiguration has changed
QHB: Total number of connections is too highTotal number of connections is too high
QHB: Response too longResponse time too long
QHB: Service is downQHB is down
QHB: Streaming lag with {#MASTER} is too highReplication lag from primary server is too high
QHB: Replication is downReplication is down
QHB: Service has been restartedQHB operating time less than 10 minutes
QHB: Version has changedQHB version has changed
DB {#DBNAME}: Too many recovery conflictsToo many conflicts between primary and standby servers during recovery
DB {#DBNAME}: Deadlock occurredDeadlock occurred
DB {#DBNAME}: VACUUM FREEZE is required to prevent wraparoundVACUUM FREEZE is required to prevent XID wraparound
DB {#DBNAME}: Number of locks is too highNumber of locks is too high
DB {#DBNAME}: Too many slow queriesToo many slow queries
QHB: Failed to get itemsZabbix has not received data for items in the last 30 minutes


Setting up Metric Collection

In order for the dashboard to display metric data, you should perform some settings.


Metric Server Configuration

Metric server configuration is described in Chapter Metric Server.

It is recommended to specify the host name on which the metric server is running in the prefix parameter of the metric server's configuration file /etc/metricsd/config.yaml. If you do this for each server, all metrics will be organized hierarchically, and the first level of the hierarchy will be the server level. That is why the metric names in the proposed dashboards contain the $server_name variable. It is assumed that only one database cluster is running on the host.


Configuring Database Settings

To configure sending metrics before QHB 1.3.0, you need to set the metrics_collector_id parameter in qhb.conf to the value with which the metric collector is launched, for example 1001. Starting with QHB 1.3.0, instead of metrics_collector_id, the metrics_collector_path parameter is used, which by default has the value @metrics-collector (represents the path to the Unix domain socket); the metric server by default starts at this address.

To configure sending annotations, you need to specify the following parameters in qhb.conf:

  • grafana.address — Grafana address, for example *http://localhost:3000
  • grafana.token — you should specify the token obtained in Grafana at the address http://localhost:3000/org/apikeys

Example of settings in qhb.conf for sending metrics and annotations:

# Before QHB 1.3.0
# metrics_collector_id = 1001  

# Starting with QHB 1.3.0:
metrics_collector_path = '@metrics-collector'

grafana.address = 'http://localhost:3000'
grafana.token = 'eyJrIjoiNGxTaloxMUNTQkFUMTN0blZqUTN6REN6OWI5YjM1MzMiLCJuIjoidGVzdCIsImlkIjoxfQ=='

Note
If you need to write metric data to CSV files, the path to the Unix domain socket file must be specified as the value of the metrics_collector_path parameter, for example, /tmp/metrics-collector.sock. The same value must be specified in the bind_addr parameter of the collection section in the metric server configuration (/etc/metricsd/config.yaml).

To collect system metrics (the "Operating System" dashboard), you need to set qhb_os_monitoring to on (enabled). You can also set the system statistics collection period qhb_os_stat_period (the default is 30 seconds). It is not recommended to set this parameter to a value that is too low, since collecting system statistics requires some overhead.

You can specify the following in the parameters file:

qhb_os_monitoring = on
qhb_os_stat_period = 60 # if the default period of 30 seconds does not work for you

Or execute the commands:

ALTER SYSTEM SET qhb_os_monitoring = ON;
ALTER SYSTEM SET qhb_os_stat_period = 60;
SELECT pg_reload_conf();


Examples of Using Metrics in SQL Functions

In addition to the built-in metrics, users can use their own metrics via the following SQL functions.


Timer Metric Type

It is used to record a period of time, units of measurement are nanoseconds.

SELECT qhb_timer_report('qhb.timer.nano',10000000000 /* 10 seconds in nanoseconds */);

Counter Metric Type

It is used when it is necessary to record the number of events that occurred over a period of time.

SELECT qhb_counter_increase_by('qhb.example.counter',10);

Gauge Metric Type

It is used when you need to set a static indicator to a certain value or change it.

SELECT qhb_gauge_update('qhb.gauge_example.value', 10); /* Set the value */
SELECT qhb_gauge_add('qhb.gauge_example.value',1); /* Increase the value */
SELECT qhb_gauge_sub('qhb.gauge_example.value',1); /* Decrease the value */

Annotations

It is used when you need to add a comment to the metric data. The first parameter of the function is the comment text, the following parameters are tags.

SELECT qhb_annotation('Начало выполнения теста', 'test','billing'); /* Annotation text and two tags */