QHB Metrics

QHB metrics are currently cumulated at the database cluster level and apply to the all QHB processes regardless of whether they are background processes or processes maintaining user connections. Metric data are sent to the metric collector and aggregator metricsd; short description of metricsd you can find in Chapter Metric Server. Then metrics are aggregated and stored in Graphite; Grafana is used as an interface for it.

Metric Types

Gauge — non-negative integer value. It can be set, increased or decreased by given number.
Counter — non-negative integer value. It can only be increased by given number.
Timer — non-negative integer value, duration time in nanoseconds. The value can only be stored, some statistical characteristics are computed during aggregation:
- sum — sum of values
- count — number of stored values
- max — maximum of stored values
- min — minimum of stored values
- mean — arithmetic mean
- median — median value
- std — standard deviation
- percentile list if needed

Metric Groups

sys

The sys group includes various system metrics.

There are two global literals that control collection of system metrics:

qhb_os_monitoring — Boolean literal, the default is off. When set to on, it enables the periodical collection of metrics.
qhb_os_stat_period — the length of time in seconds between collections of system statistics, the default is 30 seconds.

Table 48. Load and CPU Metrics

Name	Description	Type
sys.load_average.1min	load average over the last minute	gauge
sys.load_average.5min	The same but over the last 5 minutes	gauge
sys.load_average.15min	The same but over the last 15 minutes	gauge
sys.cpu.1min	CPU load percent over the last minute	gauge
sys.cpu.5min	The same but over the last 5 minutes	gauge
sys.cpu.15min	The same but over the last 15 minutes	gauge

Note
load average metric values include values with two decimals precision, however due to technical intricacy during data sending raw values will be multiplied by 100 and sent as integers. Thus when printing data you should take this feature into account and divide resulting values by 100.

load average is an average number of running plus waiting threads over the specified amount of time (1, 5 and 15 minutes). Normally corresponding values are printed using uptime, top, or cat /proc/loadavg commands. See a translation of article Load Average in Linux for details about this metric's features.

If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing; if the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing. However this metric itself has a little meaning without total CPU count. Additional sys.cpu.Nmin metrics for CPU load (derivative of load average) show an approximate percent of CPUs load with account of their number and are calculated using the following simplified formula:

sys.cpu.<N>min = sys.load_average.<N>min / cpu_count * 100

where cpu_count is a number of CPU in system and N accepts the values 1, 5 or 15.

CPU number is calculated as product of physical sockets number, kernel number per socket, and tread number per kernel. Command lscpu shows all the necessary data. For example:

Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1

In this case cpu_count = 2 * 4 * 1 = 8.

An alternative and simpler method would be to obtain this value via the nproc command.

Thus, in this case, 100% load will be achieved if the value load average tends to 8. However, these calculations and values will be rather approximate and even assumed, as shown in the article linked above.

Table 49. Memory Usage Metrics

Name	Description	Type
sys.mem.total	Total size of set RAM memory	gauge
sys.mem.used	Used memory	gauge
sys.mem.free	Free memory	gauge
sys.mem.available	Memory available for running applications (not including swap, but taking into account potentially freed memory occupied by the page cache)	gauge
sys.swap.total	Total size of the swap file	gauge
sys.swap.free	Unused swap file memory	gauge

The metric values match to the output fields of the free utility (the values are displayed in kilobytes, i.e. match to the output of free -k).

Table 50. Matching of metric values to output fields of the free utility

Metric	Utility free field
sys.mem.total	Mem:total
sys.mem.used	Mem:used
sys.mem.free	Mem:free
sys.mem.available	Mem:available
sys.swap.total	Swap:total
sys.swap.free	Swap:free

You can calculate the value corresponding to the value of Mem:buff/cache output of the free utility using the formula:

Mem:buff/cache = Mem:total - Mem:used - Mem:free

Thus, in Grafana you can calculate and display this value based on other available data, using the diffSeries function.

The Mem:shared value (tmpfs virtual file system data) is not displayed via metrics.

The Swap:used value can be calculated using the formula:

Swap:used = Swap:total - Swap:free

This value can also be displayed in Grafana as a calculated value via diffSeries function.

A more detailed description of these indicators can be obtained through the OS help system for the free utility (To do this, call man free).

Table 51. Disk Usage Metrics

Name	Description	Type
sys.disk_space.total	The disk system space where the data directory is located, in bytes	gauge
sys.disk_space.free	Free space on the disk system where the data directory is located, in bytes	gauge

Metrics are applied to the disk system where the directory with the database files is located. This directory is determined by the -D command line parameter when starting the database or by the $PGDATA environment variable. The data_directory parameter in the qhb.conf configuration file can override the location of the data directory.

Table 52. Other System Metrics

Name	Description	Type
sys.processes	The total number of processes running in the system	gauge
sys.uptime	The number of seconds that have passed since the system started	gauge

db_stat

The db_stat group contains metrics for block reads and writes at the QHB instance level.

Table 53. Block Read and Write Metrics

Name	Description	Type
qhb.db_stat.numbackends	Number of active server processes	gauge
qhb.db_stat.blocks_fetched	Number of blocks received during reading	counter
qhb.db_stat.blocks_hit	Number of blocks found in cache during reading	counter
qhb.db_stat.blocks_read_time	Blocks read time, in milliseconds	counter
qhb.db_stat.blocks_write_time	Blocks write time, in milliseconds	counter
qhb.db_stat.conflicts.tablespace	Number of queries cancelled due to dropped tablespaces	counter
qhb.db_stat.conflicts.lock	Number of queries cancelled due to lock timeouts	counter
qhb.db_stat.checksum_failures	Number of data page checksum failures	counter
qhb.db_stat.conflicts.snapshot	Number of queries cancelled due to old snapshots	counter
qhb.db_stat.conflicts.bufferpin	Number of queries cancelled due to pinned buffers	counter
qhb.db_stat.conflicts.startup_deadlock	Number of queries cancelled due to deadlocks	counter
qhb.db_stat.tuples.returned	Number of tuples fetched via sequential scan	counter
qhb.db_stat.tuples.fetched	Number of tuples fetched via index scan	counter
qhb.db_stat.tuples.inserted	Number of tuples inserted to the database	counter
qhb.db_stat.tuples.updated	Number of tuples updated in the database	counter
qhb.db_stat.tuples.deleted	Number of tuples deleted from the database	counter

Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.db_stat.db_16384.numbackends.

The cache hit rate is calculated based on the qhb.db_stat.blocks_fetched and qhb.db_stat.blocks_hit metrics:

k = blocks_hit / blocks_fetched * 100%

A value greater than 90% is usually considered a good level. If the coefficient value is significantly lower than this mark, it is recommended to consider increasing the buffer cache size.

bgwr

The bgwr group contains background writing process metrics.

Table 54. Background Write Process Metrics

Name	Description	Type
qhb.bgwr.checkpoints_timed	Number of planned checkpoints that were executed	counter
qhb.bgwr.checkpoints_req	number of requested checkpoints performed out of the scheduled order	counter
qhb.bgwr.checkpoint_write_time	Time spent in the checkpoint processing stage when files are written to disk, in milliseconds	counter
qhb.bgwr.checkpoint_sync_time	Time spent in the checkpoint processing stage when files are synchronized to disk, in milliseconds	counter
qhb.bgwr.buffers_checkpoint	Number of buffers written during checkpoints	counter
qhb.bgwr.buffers_clean	Number of buffers written by the background writer	counter
qhb.bgwr.maxwritten_clean	Number of times the background writer stopped flushing dirty pages because it was writing too many buffers	counter
qhb.bgwr.buffers_backend	Number of buffers written directly by the backend	counter
qhb.bgwr.buffers_backend_fsync	Number of times the backend called fsync itself (usually it is handled by the background writer even when the backend is writing itself)	counter
qhb.bgwr.buffers_alloc	Number of allocated buffers	counter

Data on the listed metrics is displayed in the "Checkpoints and Buffer Operations" section of the QHB dashboard. Typically, when scheduled checkpoints are performed, information about the start of the checkpoint is first recorded, then blocks are flushed to disk for some time, and at the end of the checkpoint, information about the recording duration and data synchronization is recorded. In the case of scheduled checkpoint processing, block writes are evenly distributed over time according to the settings to reduce the impact of this process on overall I/O. In the case of checkpoints requested via a command, block writes occur immediately, without artificial delay.

Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.bgwr.db_16384.buffers_backend.

mem_stat

The mem_stat group contains metrics for tracking the size of the work area memory.

The work area memory can be roughly described as the memory used by internal sorting operations and hash tables. The size of this memory is controlled by the global configuration parameter work_mem: when the limit is exceeded, temporary files on the disk are used instead of RAM. The work_mem parameter limits the memory used by a single operation.

The mem_stat group includes three types of metrics with the common name work_area:

qhb.mem_stat.work_area shows the total size of work area for the entire cluster, in bytes
qhb.mem_stat.db_.work_area shows the size of work area for one database, in bytes
qhb.mem_stat..work_area shows the size of work area for one worker process, in bytes

Metrics of the second and third types are created dynamically.

All metrics are of type gauge, meaning their values may decrease as previously occupied memory is freed.

temp_stat

The temp_stat group includes metrics for QHB temporary files and tables.

Table 55. Metrics for temporary files and tables

Name	Description	Type
qhb.temp_stat.temp_files	Number of temporary files created during the aggregation period	counter
qhb.temp_stat.temp_bytes	Total size of temporary files created during the aggregation period, in bytes	counter
qhb.temp_stat.temp_tables	Number of temporary tables created during the aggregation period	counter
qhb.temp_stat.temp_table_bytes	Total size of temporary tables created during the aggregation period, in bytes	counter

Data is collected for the aggregation period of the metric collector. Data is updated after the corresponding operation is completed, for example, when creating a temporary table with the DROP ON COMMIT clause, statistics are updated after the COMMIT command is executed during deleting the temporary table. In some cases, the space occupied by temporary files and tables can be quite significant.

Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.temp_stat.db_16384.temp_bytes.

wal

The wal group contains metrics for the write-ahead log archiving process.

Table 56. WAL Archiving Process Metrics

Name	Description	Type
qhb.wal.archived	Number of successful write-ahead log file archiving operations.	counter
qhb.wal.failed	Number of failed archiving attempts	counter
qhb.wal.archive_time	Time spent copying log files, in nanoseconds	counter

These metrics work if write-ahead log archiving is configured by setting archive_mode to on and defining the archiving command in archive_command.

transaction

The transaction group includes transaction metrics.

Table 57. Transaction Metrics

Name	Description	Type
qhb.transaction.commit	Number of transaction commits	counter
qhb.transaction.rollback	Number of transaction rollbacks	counter
qhb.transaction.deadlocks	Number of deadlocks	counter

Metric data is collected directly when committing and rolling back transactions at the database cluster level.

The transaction commit ratio is calculated as a percentage of transaction commits to the sum of transaction commits and rollbacks:

k = commit/(commit + rollback)*100%

Typically, the value tends to 100%, since transactions are usually completed successfully. A significant portion of rolled-back transactions may indicate that there are problems in the system.

Deadlocks occur when different sessions mutually wait for locked data to be released. In this case, after the deadlock is automatically detected, one of the transactions is rolled back.

wait

The wait group contains wait event metrics.

This set of metrics fully corresponds to the set of standard wait events. The metrics have the prefix qhb.wait. Also the name contains the wait event class and the wait event name after a dot. In the current release, metric names are limited in size and have a maximum of 31 characters in the name. All wait event metrics have the counter type, but the value reflects the amount of time in microseconds that these waits took as a whole across all sessions during the collection period.

Note
If there were many user connections in the waiting state during the observation period, the total waiting time could be many times greater than the observation period. For example, if 1,000 sessions spent one second waiting over 10 seconds, the total waiting time would be 1,000 seconds.

Metrics with wait event types Lock, LWLock, IO, and IPC are displayed in the "Wait Events" section of the QHB dashboard. The metric values are displayed in microseconds (automatically converted to other units when the values increase). The existing graphs do not display all wait events, but only the five most significant by value on each graph. Different wait events may have very different durations. Great value variability may reflect emerging problems.

Table 58. The Most Significant Wait Events

Name	Description
Lock.extend	Waiting for relation to be extended. Becomes noticeable when tables are actively growing. The need to allocate new blocks leads to some delays, which are reflected in this metric.
Lock.transactionid	Waiting for transaction to be completed. It occurs when a transaction is forced to wait for previous transactions to complete their processing, which also receive confirmation of their completion.
Lock.tuple	Waiting for acquiring a lock on a tuple. It occurs when multiple transactions are working on the same data at the same time.
LWLock.WALWriteLock	Waiting for WAL buffers to be written to disk. Often it is the leader of this type of wait event, as disk operations are the slowest of this wait event group.
LWLock.wal_insert	Waiting for WAL to be inserted into memory buffer.
LWLock.buffer_content	Waiting for a data page to be read or written to memory. It occurs and becomes significant during intensive I/O.
LWLock.buffer_mapping	Waiting for a data block to be mapping with a buffer in the buffer pool.
LWLock.lock_manager	Waiting for locks for backend processes to be added or checked. Event becomes significant when transactions are frequent.

bufmgr

The bufmgr group contains buffer manager metrics related to memory management mechanisms. Metrics are collected during data read operations.

Table 59. Buffer Manager Metrics

Name	Description	Type	Aggregate
qhb.bufmgr.BufferAlloc	Number of times the buffer was searched	timer	count
qhb.bufmgr.BufferAlloc	Total time spent searching for the buffer	timer	sum
qhb.bufmgr.happy_path	Number of times the buffer was found immediately	timer	count
qhb.bufmgr.happy_path	Total time spent searching when the buffer was found immediately	timer	sum
qhb.bufmgr.cache_miss	Number of buffer cache misses	timer	count
qhb.bufmgr.cache_miss	Total time spent processing buffer cache misses	timer	sum
qhb.bufmgr.disk_read	Number of tuple reads from disk (asynchronous)	timer	count
qhb.bufmgr.flush_dirty	Number of times the tuple was flushed to disk (asynchronous)	timer	count
qhb.bufmgr.retry_counter	Number of miss re-processings	counter
qhb.bufmgr.strategy_pop_cnt	Number of times a specific buffer popping or eviction strategy was triggered	counter
qhb.bufmgr.strategy_reject_cnt	Number of rejected buffers proposed by the special strategy	counter
tarq_cache.allocate	Number of times the TARQ search was performed	timer	count
tarq_cache.allocate	Total time spent on the TARQ search	timer	sum
tarq_cache.allocate_new	Number of times the block to be excluded was selected in TARQ	timer	count
tarq_cache.rollback	Number of eviction rollbacks in TARQ	timer	count
tarq_cache.rollback	Total time spent on eviction rollbacks in TARQ	timer	sum
tarq_cache.touch	Total time spent counting popular tuples in TARQ	timer	sum

lmgr

The lmgr group contains lock manager metrics. Locks are either primary or predicate. Predicate locks are called "locks" for historical reasons; they do not currently lock anything, but are used to track data dependencies between transactions.

The lock manager metrics track the number of busy locks and serializable transactions (transactions with the highest isolation level, Serializable), as well as the number of free slots in the hash tables for each parameter. These metrics are necessary because the sizes of the hash tables storing locks and serializable transactions are limited, and exceeding the limit may cause an out-of-memory message.

The metrics are collected during data read operations.

Table 60. Lock Manager Metrics

Name	Description	Type
qhb.lmgr.locks	Number of busy locks	counter
qhb.lmgr.locks_available	Number of locks that can still be created	counter
qhb.lmgr.proc_locks	Number of busy PGPROC locks	counter
qhb.lmgr.proc_locks_available	Number of PGPROC locks that can still be created	counter
qhb.lmgr.serializable_xids	Number of active serializable transactions	counter
qhb.lmgr.serializable_xids_available	The number of serializable transactions that can still be created	counter
qhb.lmgr.pred_locks	Number of busy predicate "locks"	counter
qhb.lmgr.pred_locks_available	The number of predicate "locks" that can still be created	counter
qhb.lmgr.pred_lock_targets	Number of busy predicate target "locks"	counter
qhb.lmgr.pred_lock_targets_available	The number of predicate target "locks" that can still be created	counter

queryid

The queryid group contains metrics that provide the ability to track the planning and execution statistics of (almost) all SQL statements on the server.

This group is distinctive in that its metrics are created dynamically when the engine starts processing a new operator. A unique name or query ID is added to the names of new metrics. For example, the metric qhb.queryid.c92e145f160e7b9e.exec_calls reflects the execution number for a certain SQL statement. The text of the statement itself can be obtained from the new system table qhb_queryid using a query like SELECT * FROM qhb_queryid WHERE qid = 'c92e145f160e7b9e'.

More details on the queryid group metric configuration are described in the Section queryid Metric Configuration.

Table 61. queryid Metrics

Name	Description	Type
qhb.queryid.id.plan_calls	Number of the operator plans	counter
qhb.queryid.id.total_plan_time	Total time spent planning the operator, in nanoseconds	counter
qhb.queryid.id.min_plan_time	Minimum time spent planning the operator, in nanoseconds	gauge
qhb.queryid.id.max_plan_time	Maximum time spent planning the operator, in nanoseconds	gauge
qhb.queryid.id.mean_plan_time	Mean time spent planning the operator, in nanoseconds	gauge
qhb.queryid.id.stddev_plan_time	Standard deviation of operator planning time, in nanoseconds	gauge
qhb.queryid.id.exec_calls	Number of the operator executions	counter
qhb.queryid.id.total_exec_time	Total time spent executing the operator, in nanoseconds	counter
qhb.queryid.id.min_exec_time	Minimum time spent executing the operator, in nanoseconds	gauge
qhb.queryid.id.max_exec_time	Maximum time spent executing the operator, in nanoseconds	gauge
qhb.queryid.id.mean_exec_time	Mean time spent executing the operator, in nanoseconds	gauge
qhb.queryid.id.stddev_exec_time	Standard deviation of operator executing time, in nanoseconds	gauge
qhb.queryid.id.shared_blks_hit	Total number of shared block cache hits for this operator	counter
qhb.queryid.id.shared_blks_read	Total number of shared block reads for this operator	counter
qhb.queryid.id.shared_blks_dirtied	Total number of shared block dirtied by this operator	counter
qhb.queryid.id.shared_blks_written	Total number of shared blocks written by this operator	counter
qhb.queryid.id.local_blks_hit	Total number of local block cache hits for this operator	counter
qhb.queryid.id.local_blks_read	Total number of local block reads for this operator	counter
qhb.queryid.id.local_blks_dirtied	Total number of local blocks dirtied by this operator	counter
qhb.queryid.id.local_blks_written	Total number of local blocks written by this operator	counter
qhb.queryid.id.temp_blks_read	Total number of temporal block reads for this operator	counter
qhb.queryid.id.temp_blks_written	Total number of temporal blocks written by this operator	counter
qhb.queryid.id.exec_rows	The total number of rows retrieved or affected by this operator	counter
qhb.queryid.id.blk_read_time	Total time spent by the operator reading blocks, in milliseconds (if track_io_timing is enabled; otherwise 0)	counter
qhb.queryid.id.blk_write_time	Total time spent by the operator writing blocks, in milliseconds (if track_io_timing is enabled; otherwise 0)	counter
qhb.queryid.id.wal_records	Total number of WAL records generated by this operator	counter
qhb.queryid.id.wal_fpi	Total number of full WAL page images generated by this operator	counter
qhb.queryid.id.wal_bytes	Total amount of WAL generated by this operator, in bytes	counter

qss

The qss group contains metrics related to encryption/decryption processes when working with tables processed by QSS.

Table 62. Metrics for qss tables

Name	Description	Type
qhb.qss2.wals.encrypt.count	Number of encrypted WAL messages	counter
qhb.qss2.wals.encrypt.bytes	Number of encrypted WAL message bytes	counter
qhb.qss2.wals.encrypt.time	Time spent encrypting a WAL message	timer
qhb.qss2.wals.decrypt.count	Number of decrypted WAL messages	counter
qhb.qss2.wals.decrypt.bytes	Number of decrypted WAL message bytes	counter
qhb.qss2.wals.decrypt.time	Time spent decrypting a WAL message	timer
qhb.qss2.pages.encrypt.count	Number of encrypted data pages	counter
qhb.qss2.pages.encrypt.bytes	The number of encrypted data page bytes	counter
qhb.qss2.pages.encrypt.time	Time spent encrypting a data page	timer
qhb.qss2.pages.decrypt.count	Number of decrypted data pages	counter
qhb.qss2.pages.decrypt.bytes	Number of decrypted data page bytes	counter
qhb.qss2.pages.decrypt.time	Time spent decrypting a data page	timer
qhb.qss2.blobs.encrypt.count	Number of encrypted non-fixed size objects (Rbytea)	counter
qhb.qss2.blobs.encrypt.bytes	The number of encrypted non-fixed size object bytes (Rbytea)	counter
qhb.qss2.blobs.encrypt.time	Time spent encrypting non-fixed size objects (Rbytea)	timer
qhb.qss2.blobs.decrypt.count	Number of decrypted non-fixed size objects (Rbytea)	counter
qhb.qss2.blobs.decrypt.bytes	Number of decrypted non-fixed size object bytes (Rbytea)	counter
qhb.qss2.blobs.decrypt.time	Time spent decrypting non-fixed size objects (Rbytea)	timer

These metrics are generated when QSS is running on the server and an encrypted table is created (CREATE TABLE <table name> USING qss).

QCP Connection Pool Metrics

Connection pool metrics that characterize the pool's operation are described below. The metrics are collected during query execution operations.

Table 63. QCP Connection Pool Metrics

Name	Description	Type
qcp.queue	The number of queries in the queue at the moment	gauge
qcp.obtain_backend	Timeout for backend assignment to fulfill client query	timer
qcp.obtain_backend_failed	Maximum timeout for backend assignment to fulfill client query exceeded	timer
<dbms address>.in_use	Number of used connections (backends) to the DBMS	gauge

Enabling and Disabling Value Recording for Metric Groups and Categories

QHB has a mechanism that allows you to enable and disable metric sending separately for groups and categories. The data directory contains a configuration file metrics_config.toml, in which each string corresponds to a metric group:

[is_enabled]
bgwr = { basic = true, dbid = false, pid = false }
bufmgr = { basic = true, dbid = false, pid = false }
db_stat = { basic = true, dbid = false, pid = false }
default = { basic = true, dbid = false, pid = false }
lmgr = { basic = true, dbid = false, pid = false }
mem_stat = { basic = false, dbid = false, pid = false }
queryid = { basic = false, dbid = false, pid = false }
sys = { basic = true, dbid = false, pid = false }
temp_stat = { basic = true, dbid = false, pid = false }
transaction = { basic = true, dbid = false, pid = false }
wait = { basic = true, dbid = false, pid = false }
wal = { basic = true, dbid = false, pid = false }

The variables take the values true (sending group metrics is allowed) or false (sending is disallowed).

Note
The qss group is included in group bgwr.

Table 64. Setting up Value Recording for Metric Groups

Metric Group	Description	Default
default	Default group (various metrics not assigned to separate groups)	true
sys	Operating system metrics	true
db_stat	Block read and write metrics	true
bgwr	Background writer metrics	true
mem_stat	"Work area" memory size metrics	false
temp_stat	Metrics for temporary files and tables	true
wal	WAL file archiving metrics	true
transaction	Transaction metrics	true
wait	Wait event metrics	true
bufmgr	Metrics of memory management mechanisms	true
lmgr	Lock (main and predicate) mechanisms metrics	true
queryid	SQL server scheduling and execution metrics	false

Note
To collect operating system metrics, you must also set the qhb_os_monitoring parameter to on.

To view the list of metric groups and their current state, you can use the SQL function metrics_config_show. An example of calling this function:

SELECT * FROM metrics_config_show();
group_name  | glob  | dbid | pid
-------------+-------+------+-----
bgwr        | t     | f    | f
bufmgr      | t     | f    | f
db_stat     | t     | f    | f
default     | t     | f    | f
lmgr        | t     | f    | f
mem_stat    | f     | f    | f
queryid     | f     | f    | f
sys         | t     | f    | f
temp_stat   | t     | f    | f
transaction | t     | f    | f
wait        | t     | f    | f
wal         | t     | f    | f
(12 rows)

To view the current state of metrics for a specific process, you can use the SQL function metrics_transient_config_show. An integer representing the PID is passed as an argument. An example of calling this function:

SELECT * FROM metrics_transient_config_show(1843);

If the metrics configuration for this process has not been changed, the output is the same as the output of the metrics_config_show function.

In addition to setting the settings via a parameters file, it is possible to change these values via functions. Functions can be called with a different number of arguments. If the kind parameter argument is omitted, the function applies to all metric categories. If the respect_transient_settings parameter argument is omitted, its default value is false.

Changes are immediately reflected in the configuration file.

Functions for Enabling and Disabling Sending Metrics

metrics_config_enable_group ( group_name: cstring ) → void
metrics_config_enable_group ( group_name: cstring, respect_transient_settings: boolean ) → void
metrics_config_enable_group ( group_name: cstring, kind: cstring, respect_transient_settings: boolean ) → void

Enables sending values for metrics in a given group/category.

metrics_config_disable_group ( group_name: cstring ) → void
metrics_config_disable_group ( group_name: cstring, respect_transient_settings: boolean ) → void
metrics_config_disable_group ( group_name: cstring, kind: cstring, respect_transient_settings: boolean ) → void

Disables sending values for metrics of the specified group/category.

metrics_config_transient_set ( backend_pid: int4, group_name: cstring, option_value: boolean ) → void
metrics_config_transient_set ( backend_pid: int4, group_name: cstring, kind: cstring, option_value: boolean ) → void

Sets the time limit for sending metric values for the specified backend.

Metric categories are expressed by the following values of the kind parameter: 'basic', 'dbid', 'pid'.

The parameter respect_transient_settings (second or third depending on the function option) in the functions metrics_config_enable_group and metrics_config_disable_group can be used to specify whether to give priority to the metric group sending parameters set at the backend level.

Using the metrics_config_transient_set function, you can enable or disable sending the values of a specified metric group and category for a specific backend. The setting will last for the entire lifetime of the backend, or until you call the metrics_config_enable_group or metrics_config_disable_group function with the parameter respect_transient_settings = false, which in this case will also extend the effect of the command to those backends in which their own parameter values were set.

When starting a backend with ID that was previously used, all temporary settings for this ID are reset.

Note
Everything said about disabling the recording of metric groups does not apply to metrics that are passed through SQL functions. When using these functions, metrics are always recorded.

QHB Metric Dashboards for Grafana

The QHB dashboards for Grafana are located in the repository at the following link.

QHB is supplied with a metric server that records metric data to Graphite. Grafana serves as the interface to these metrics. The current set of dashboards for Grafana are provided as self-documenting samples, from which users can build dashboards that better suit their needs. Of course, the provided dashboards can also be used as is.

Importing Dashboards

Export of JSON description of dashboards is done in Grafana 6.7.2.

Before importing the JSON description, you need to decide whether the metric names will contain the host name as a prefix. This is how the metric names are structured inside the panels, and it is recommended to leave this option. At the beginning of the metric names, the $server_name variable is added, by default, the value your_host_name is selected for it. Before importing, you can replace this value with the name of one of the hosts in the JSON files. Later, via the Grafana interface, you can add all the host names (comma-separated) from which metrics will be collected into this variable. This will allow you to quickly switch from one host to another when viewing metrics. If such a scheme will not be used (if metrics from a single host are viewed), you can remove the $server_name prefix in all metric names in the JSON files before importing the JSON description. However, this is a more laborious option and is not recommended.

To import dashboard descriptions, you need to follow these steps:

In the Dashboards menu of your Grafana site, select Manage.
In the opened folder and panel list, select an existing folder or create a new one.
In the selected folder, select Import in the upper right part of the page.
On the opened, you can either click the Upload json file button in the upper right corner and upload the file, or paste the contents of the JSON file into the field under the Or paste JSON heading and click the Load button.
Then you need to fill in the required parameters and download the JSON description.

Operating System Dashboard

The dashboard presents the main system indicators:

QHB instance running time;
Average load;
RAM usage;
Memory usage;
Usage of the disk system where the database directory is located.

QHB Dashboard

The dashboard contains several sections:

Transactions;
Reading and writing blocks;
Wait events;
Checkpoints and buffer operations;
WAL archiving.

Each section contains sets of thematic dashboards that display key indicators.

Note
By default, tags for various events are not set, so you need to manually configure the output of additional annotations:

add the necessary comments to the metric data using the qhb_annotation function (see Section Annotations);
configure the output of available annotations in the Grafana dashboard.

QHB Monitoring for Zabbix

Example of using Zabbix 6.2 server to monitor a host with QHB and Zabbix agent.

Required files:

template_db_qhb.yaml is the "QHB by Zabbix agent" template file, which needs to be imported to the Zabbix 6.2 server;
template_db_qhb.conf is the file with user parameters of Zabbix agent for polling QHB;
the qhb/ directory containing the SQL files accessed by the user parameters.

The archive for Zabbix 6.2 or higher is located in the repository at the following link.

Installation

Note
For more information, see the Zabbix documentation on working with agent templates.

Install Zabbix agent on host with QHB
Copy the qhb/ directory to the Zabbix agent home directory /var/lib/zabbix. If the zabbix/ directory is absent in /var/lib/, it must be created. The qhb/ directory contains the SQL files required for getting metrics from QHB:

# mkdir -p /var/lib/zabbix/qhb
# cd /var/lib/zabbix/qhb
# wget <server>/zabbix/qhb/qhb.tar
# tar -xvf qhb.tar
# chmod -R 707 /var/lib/zabbix/qhb
# rm -rf qhb.tar

Copy the template_db_qhb.conf file to the Zabbix agent configuration directory /etc/zabbix/zabbix_agentd.d/:

# wget <server>/zabbix/template_db_qhb.conf

Create a user zbx_monitor with read-only privileges and access to the QHB cluster:

CREATE USER zbx_monitor WITH PASSWORD '<PASSWORD>' INHERIT;
GRANT pg_monitor TO zbx_monitor;

Edit the qhb_hba.conf file to allow connection to Zabbix. To do this, open it in the nano text editor and add the following strings into it:

host all zbx_monitor 127.0.0.1/32 trust

Restart QHB and Zabbix agent:

# systemctl restart qhb
# systemctl restart zabbix_agentd

Import the template_db_qhb.yaml file of the "QHB by Zabbix agent" template on the Zabbix server. For more information on importing templates, see the Zabbix documentation on working with importing templates.
Set the macro parameters {$PG.HOST}, {$PG.PORT}, {$PG.USER}, {$PG.PASSWORD}, {$PG.DB} for the host with QHB.
Attach the "QHB by Zabbix agent" template to the host with QHB.

Collected Parameters

Table 65. Collected Parameters

Group	Name	Description
QHB	Bgwriter: Buffers allocated per second	Number of buffers allocated per second
QHB	Bgwriter: Buffers written directly by a backend per second	Number of buffers written directly by a backend per second
QHB	Bgwriter: Buffers backend fsync per second	Number of times per second the backend had to call fsync itself (usually it is handled by the background writer even when the backend is writing itself)
QHB	Bgwriter: Buffers written during checkpoints per second	Number of buffers written during checkpoints per second
QHB	Bgwriter: Buffers written by the background writer per second	Number of buffers written by the background writer per second
QHB	Bgwriter: Requested checkpoints per second	Number of requested checkpoints that were executed per second
QHB	Bgwriter: Scheduled checkpoints per second	The number of scheduled checkpoints that were executed per second
QHB	Bgwriter: Checkpoint sync time	Total time spent synchronizing files to disk while processing checkpoints
QHB	Bgwriter: Checkpoint write time	Total time spent writing files to disk during checkpoint processing, in milliseconds
QHB	Bgwriter: Max written per second	How many times per second the background writer stopped the flushing scan because it had written too many buffers
QHB	Status: Cache hit ratio %	Cache hit rate
QHB	Status: Config hash	QHB configuration hash
QHB	Connections sum: Active	Total number of connections executing queries
QHB	Connections sum: Idle	Total number of connections waiting for a new command from the client
QHB	Connections sum: Idle in transaction	Total number of connections in transaction state but not executing a query
QHB	Connections sum: Prepared	Total number of prepared transactions
QHB	Connections sum: Total	Total number of connections
QHB	Connections sum: Total %	Total number of connections in percent
QHB	Connections sum: Waiting	Total number of waiting transactions
QHB	Status: Ping time	Ping time
QHB	Status: Ping	Ping
QHB	Replication: standby count	Number of standby servers
QHB	Replication: lag in seconds	Replication lag from primary server in seconds
QHB	Replication: recovery role	Replication role: 1 - recovery still ongoing (standby server mode), 0 - primary server mode
QHB	Replication: status	Replication status: 0 - streaming is decreasing, 1 - streaming is increasing, 2 - primary server mode
QHB	Transactions: Max active transaction time	Current maximum duration of an active transaction
QHB	Transactions: Max idle transaction time	Current maximum duration of an idle transaction
QHB	Transactions: Max prepared transaction time	Current maximum duration of a prepared transaction
QHB	Transactions: Max waiting transaction time	Current maximum duration of a waiting transaction
QHB	Status: Uptime	Total system uptime
QHB	Status: Version	QHB version
QHB	WAL: Segments count	Number of WAL segments
QHB	WAL: Bytes written	Size of WAL records in bytes
QHB	DB {#DBNAME}: Database size	This database size
QHB	DB {#DBNAME}: Blocks hit per second	Number of times disk blocks were found already in the buffer cache, so there was no need to read them
QHB	DB {#DBNAME}: Disk blocks read per second	Total number of disk blocks read in this database
QHB	DB {#DBNAME}: Detected conflicts per second	Total number of queries cancelled due to recovery conflicts in this database
QHB	DB {#DBNAME}: Detected deadlocks per second	Total number of deadlocks detected in this database
QHB	DB {#DBNAME}: Temp_bytes written per second	The total amount of data written to temporary files by queries in this database.
QHB	DB {#DBNAME}: Temp_files created per second	The total number of temporary files created by queries in this database.
QHB	DB {#DBNAME}: Tuples deleted per second	The total number of tuples deleted by queries in this database
QHB	DB {#DBNAME}: Tuples fetched per second	The total number of tuples fetched by queries in this database
QHB	DB {#DBNAME}: Tuples inserted per second	The total number of tuples inserted by queries in this database
QHB	DB {#DBNAME}: Tuples returned per second	The total number of tuples returned by queries in this database
QHB	DB {#DBNAME}: Tuples updated per second	The total number of tuples updated by queries in this database
QHB	DB {#DBNAME}: Commits per second	Number of committed transactions in this database
QHB	DB {#DBNAME}: Rollbacks per second	Total number of rolled back transactions in this database
QHB	DB {#DBNAME}: Frozen XID before autovacuum %	Percentage of frozen XIDs before autovacuum
QHB	DB {#DBNAME}: Frozen XID before stop %	Percentage of frozen XIDs before stop
QHB	DB {#DBNAME}: Locks total	Total number of locks in this database
QHB	DB {#DBNAME}: Queries slow maintenance count	Slow maintenance query counter
QHB	DB {#DBNAME}: Queries max maintenance time	Maximum duration of a maintenance query
QHB	DB {#DBNAME}: Queries sum maintenance time	Total duration of maintenance queries
QHB	DB {#DBNAME}: Queries slow query count	Slow query counter
QHB	DB {#DBNAME}: Queries max query time	Maximum duration of a query
QHB	DB {#DBNAME}: Queries sum query time	Total duration of queries
QHB	DB {#DBNAME}: Queries slow transaction count	Slow transaction query counter
QHB	DB {#DBNAME}: Queries max transaction time	Maximum duration of a transaction query
QHB	DB {#DBNAME}: Queries sum transaction time	Total duration of transaction queries
QHB	DB {#DBNAME}: Index scans per second	Number of index scans in this database
QHB	DB {#DBNAME}: Sequential scans per second	Number of sequential scans in this database
Zabbix source items	QHB: Get bgwriter	Background recording process activity statistics
Zabbix source items	QHB: Get connections sum	Collect all metrics from *pg_stat_activity
Zabbix source items	QHB: Get dbstat	Collect all metrics from pg_stat_database for each database
Zabbix source items	QHB: Get locks	Collect all metrics from pg_locks for each database
Zabbix source items	QHB: Get queries	Collect all metrics on query execution time
Zabbix source items	QHB: Get transactions	Collect transaction execution time metrics
Zabbix source items	QHB: Get WAL	The main item for collecting WAL metrics
Zabbix source items	DB {#DBNAME}: Get frozen XID	Number of frozen XIDs
Zabbix source items	DB {#DBNAME}: Get scans	The number of scans performed on a table/index in this database

Triggers

Table 66. Triggers

Name	Description
QHB: Required checkpoints occurs too frequently	Required checkpoints occur too frequently
QHB: Cache hit ratio too low	Too low cache hit ratio
QHB: Configuration has changed	Configuration has changed
QHB: Total number of connections is too high	Total number of connections is too high
QHB: Response too long	Response time too long
QHB: Service is down	QHB is down
QHB: Streaming lag with {#MASTER} is too high	Replication lag from primary server is too high
QHB: Replication is down	Replication is down
QHB: Service has been restarted	QHB operating time less than 10 minutes
QHB: Version has changed	QHB version has changed
DB {#DBNAME}: Too many recovery conflicts	Too many conflicts between primary and standby servers during recovery
DB {#DBNAME}: Deadlock occurred	Deadlock occurred
DB {#DBNAME}: VACUUM FREEZE is required to prevent wraparound	`VACUUM FREEZE` is required to prevent XID wraparound
DB {#DBNAME}: Number of locks is too high	Number of locks is too high
DB {#DBNAME}: Too many slow queries	Too many slow queries
QHB: Failed to get items	Zabbix has not received data for items in the last 30 minutes

Setting up Metric Collection

In order for the dashboard to display metric data, you should perform some settings.

Metric Server Configuration

Metric server configuration is described in Chapter Metric Server.

It is recommended to specify the host name on which the metric server is running in the prefix parameter of the metric server's configuration file /etc/metricsd/config.yaml. If you do this for each server, all metrics will be organized hierarchically, and the first level of the hierarchy will be the server level. That is why the metric names in the proposed dashboards contain the $server_name variable. It is assumed that only one database cluster is running on the host.

Configuring Database Settings

To configure sending metrics before QHB 1.3.0, you need to set the metrics_collector_id parameter in qhb.conf to the value with which the metric collector is launched, for example 1001. Starting with QHB 1.3.0, instead of metrics_collector_id, the metrics_collector_path parameter is used, which by default has the value @metrics-collector (represents the path to the Unix domain socket); the metric server by default starts at this address.

To configure sending annotations, you need to specify the following parameters in qhb.conf:

grafana.address — Grafana address, for example *http://localhost:3000
grafana.token — you should specify the token obtained in Grafana at the address http://localhost:3000/org/apikeys

Example of settings in qhb.conf for sending metrics and annotations:

# Before QHB 1.3.0
# metrics_collector_id = 1001  

# Starting with QHB 1.3.0:
metrics_collector_path = '@metrics-collector'

grafana.address = 'http://localhost:3000'
grafana.token = 'eyJrIjoiNGxTaloxMUNTQkFUMTN0blZqUTN6REN6OWI5YjM1MzMiLCJuIjoidGVzdCIsImlkIjoxfQ=='

Note
If you need to write metric data to CSV files, the path to the Unix domain socket file must be specified as the value of the metrics_collector_path parameter, for example, /tmp/metrics-collector.sock. The same value must be specified in the bind_addr parameter of the collection section in the metric server configuration (/etc/metricsd/config.yaml).

To collect system metrics (the "Operating System" dashboard), you need to set qhb_os_monitoring to on (enabled). You can also set the system statistics collection period qhb_os_stat_period (the default is 30 seconds). It is not recommended to set this parameter to a value that is too low, since collecting system statistics requires some overhead.

You can specify the following in the parameters file:

qhb_os_monitoring = on
qhb_os_stat_period = 60 # if the default period of 30 seconds does not work for you

Or execute the commands:

ALTER SYSTEM SET qhb_os_monitoring = ON;
ALTER SYSTEM SET qhb_os_stat_period = 60;
SELECT pg_reload_conf();

Examples of Using Metrics in SQL Functions

In addition to the built-in metrics, users can use their own metrics via the following SQL functions.

Timer Metric Type

It is used to record a period of time, units of measurement are nanoseconds.

SELECT qhb_timer_report('qhb.timer.nano',10000000000 /* 10 seconds in nanoseconds */);

Counter Metric Type

It is used when it is necessary to record the number of events that occurred over a period of time.

SELECT qhb_counter_increase_by('qhb.example.counter',10);

Gauge Metric Type

It is used when you need to set a static indicator to a certain value or change it.

SELECT qhb_gauge_update('qhb.gauge_example.value', 10); /* Set the value */
SELECT qhb_gauge_add('qhb.gauge_example.value',1); /* Increase the value */
SELECT qhb_gauge_sub('qhb.gauge_example.value',1); /* Decrease the value */

Annotations

It is used when you need to add a comment to the metric data. The first parameter of the function is the comment text, the following parameters are tags.

SELECT qhb_annotation('Начало выполнения теста', 'test','billing'); /* Annotation text and two tags */