QHB Metrics
QHB metrics are currently cumulated at the database cluster level and apply to the all QHB processes regardless of whether they are background processes or processes maintaining user connections. Metric data are sent to the metric collector and aggregator metricsd; short description of metricsd you can find in Chapter Metric Server. Then metrics are aggregated and stored in Graphite; Grafana is used as an interface for it.
Metric Types
- Gauge — non-negative integer value. It can be set, increased or decreased by given number.
- Counter — non-negative integer value. It can only be increased by given number.
- Timer — non-negative integer value, duration time in nanoseconds. The value
can only be stored, some statistical characteristics are computed during
aggregation:
- sum — sum of values
- count — number of stored values
- max — maximum of stored values
- min — minimum of stored values
- mean — arithmetic mean
- median — median value
- std — standard deviation
- percentile list if needed
Metric Groups
sys
The sys group includes various system metrics.
There are two global literals that control collection of system metrics:
- qhb_os_monitoring — Boolean literal, the default is off. When set to on, it enables the periodical collection of metrics.
- qhb_os_stat_period — the length of time in seconds between collections of system statistics, the default is 30 seconds.
Table 48. Load and CPU Metrics
| Name | Description | Type |
|---|---|---|
| sys.load_average.1min | load average over the last minute | gauge |
| sys.load_average.5min | The same but over the last 5 minutes | gauge |
| sys.load_average.15min | The same but over the last 15 minutes | gauge |
| sys.cpu.1min | CPU load percent over the last minute | gauge |
| sys.cpu.5min | The same but over the last 5 minutes | gauge |
| sys.cpu.15min | The same but over the last 15 minutes | gauge |
Note
load average metric values include values with two decimals precision, however due to technical intricacy during data sending raw values will be multiplied by 100 and sent as integers. Thus when printing data you should take this feature into account and divide resulting values by 100.
load average is an average number of running plus waiting threads over the
specified amount of time (1, 5 and 15 minutes). Normally corresponding values are
printed using uptime, top, or cat /proc/loadavg commands. See a translation
of article Load Average in Linux for details about this metric's features.
If the 1 minute average is higher than the 5 or 15 minute averages, then load is increasing; if the 1 minute average is lower than the 5 or 15 minute averages, then load is decreasing. However this metric itself has a little meaning without total CPU count. Additional sys.cpu.Nmin metrics for CPU load (derivative of load average) show an approximate percent of CPUs load with account of their number and are calculated using the following simplified formula:
sys.cpu.<N>min = sys.load_average.<N>min / cpu_count * 100
where cpu_count is a number of CPU in system and N accepts the values 1, 5 or 15.
CPU number is calculated as product of physical sockets number, kernel number
per socket, and tread number per kernel. Command lscpu shows all the necessary
data. For example:
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
In this case cpu_count = 2 * 4 * 1 = 8.
An alternative and simpler method would be to obtain this value via the nproc
command.
Thus, in this case, 100% load will be achieved if the value load average tends to 8. However, these calculations and values will be rather approximate and even assumed, as shown in the article linked above.
Table 49. Memory Usage Metrics
| Name | Description | Type |
|---|---|---|
| sys.mem.total | Total size of set RAM memory | gauge |
| sys.mem.used | Used memory | gauge |
| sys.mem.free | Free memory | gauge |
| sys.mem.available | Memory available for running applications (not including swap, but taking into account potentially freed memory occupied by the page cache) | gauge |
| sys.swap.total | Total size of the swap file | gauge |
| sys.swap.free | Unused swap file memory | gauge |
The metric values match to the output fields of the free utility (the values
are displayed in kilobytes, i.e. match to the output of free -k).
Table 50. Matching of metric values to output fields of the free utility
| Metric | Utility free field |
|---|---|
| sys.mem.total | Mem:total |
| sys.mem.used | Mem:used |
| sys.mem.free | Mem:free |
| sys.mem.available | Mem:available |
| sys.swap.total | Swap:total |
| sys.swap.free | Swap:free |
You can calculate the value corresponding to the value of Mem:buff/cache output of the free utility using the formula:
Mem:buff/cache = Mem:total - Mem:used - Mem:free
Thus, in Grafana you can calculate and display this value based on other available data, using the diffSeries function.
The Mem:shared value (tmpfs virtual file system data) is not displayed via metrics.
The Swap:used value can be calculated using the formula:
Swap:used = Swap:total - Swap:free
This value can also be displayed in Grafana as a calculated value via diffSeries function.
A more detailed description of these indicators can be obtained through the OS help system for the free utility (To do this, call man free).
Table 51. Disk Usage Metrics
| Name | Description | Type |
|---|---|---|
| sys.disk_space.total | The disk system space where the data directory is located, in bytes | gauge |
| sys.disk_space.free | Free space on the disk system where the data directory is located, in bytes | gauge |
Metrics are applied to the disk system where the directory with the database files is located. This directory is determined by the -D command line parameter when starting the database or by the $PGDATA environment variable. The data_directory parameter in the qhb.conf configuration file can override the location of the data directory.
Table 52. Other System Metrics
| Name | Description | Type |
|---|---|---|
| sys.processes | The total number of processes running in the system | gauge |
| sys.uptime | The number of seconds that have passed since the system started | gauge |
db_stat
The db_stat group contains metrics for block reads and writes at the QHB instance level.
Table 53. Block Read and Write Metrics
| Name | Description | Type |
|---|---|---|
| qhb.db_stat.numbackends | Number of active server processes | gauge |
| qhb.db_stat.blocks_fetched | Number of blocks received during reading | counter |
| qhb.db_stat.blocks_hit | Number of blocks found in cache during reading | counter |
| qhb.db_stat.blocks_read_time | Blocks read time, in milliseconds | counter |
| qhb.db_stat.blocks_write_time | Blocks write time, in milliseconds | counter |
| qhb.db_stat.conflicts.tablespace | Number of queries cancelled due to dropped tablespaces | counter |
| qhb.db_stat.conflicts.lock | Number of queries cancelled due to lock timeouts | counter |
| qhb.db_stat.checksum_failures | Number of data page checksum failures | counter |
| qhb.db_stat.conflicts.snapshot | Number of queries cancelled due to old snapshots | counter |
| qhb.db_stat.conflicts.bufferpin | Number of queries cancelled due to pinned buffers | counter |
| qhb.db_stat.conflicts.startup_deadlock | Number of queries cancelled due to deadlocks | counter |
| qhb.db_stat.tuples.returned | Number of tuples fetched via sequential scan | counter |
| qhb.db_stat.tuples.fetched | Number of tuples fetched via index scan | counter |
| qhb.db_stat.tuples.inserted | Number of tuples inserted to the database | counter |
| qhb.db_stat.tuples.updated | Number of tuples updated in the database | counter |
| qhb.db_stat.tuples.deleted | Number of tuples deleted from the database | counter |
Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.db_stat.db_16384.numbackends.
The cache hit rate is calculated based on the qhb.db_stat.blocks_fetched and qhb.db_stat.blocks_hit metrics:
k = blocks_hit / blocks_fetched * 100%
A value greater than 90% is usually considered a good level. If the coefficient value is significantly lower than this mark, it is recommended to consider increasing the buffer cache size.
bgwr
The bgwr group contains background writing process metrics.
Table 54. Background Write Process Metrics
| Name | Description | Type |
|---|---|---|
| qhb.bgwr.checkpoints_timed | Number of planned checkpoints that were executed | counter |
| qhb.bgwr.checkpoints_req | number of requested checkpoints performed out of the scheduled order | counter |
| qhb.bgwr.checkpoint_write_time | Time spent in the checkpoint processing stage when files are written to disk, in milliseconds | counter |
| qhb.bgwr.checkpoint_sync_time | Time spent in the checkpoint processing stage when files are synchronized to disk, in milliseconds | counter |
| qhb.bgwr.buffers_checkpoint | Number of buffers written during checkpoints | counter |
| qhb.bgwr.buffers_clean | Number of buffers written by the background writer | counter |
| qhb.bgwr.maxwritten_clean | Number of times the background writer stopped flushing dirty pages because it was writing too many buffers | counter |
| qhb.bgwr.buffers_backend | Number of buffers written directly by the backend | counter |
| qhb.bgwr.buffers_backend_fsync | Number of times the backend called fsync itself (usually it is handled by the background writer even when the backend is writing itself) | counter |
| qhb.bgwr.buffers_alloc | Number of allocated buffers | counter |
Data on the listed metrics is displayed in the "Checkpoints and Buffer Operations" section of the QHB dashboard. Typically, when scheduled checkpoints are performed, information about the start of the checkpoint is first recorded, then blocks are flushed to disk for some time, and at the end of the checkpoint, information about the recording duration and data synchronization is recorded. In the case of scheduled checkpoint processing, block writes are evenly distributed over time according to the settings to reduce the impact of this process on overall I/O. In the case of checkpoints requested via a command, block writes occur immediately, without artificial delay.
Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.bgwr.db_16384.buffers_backend.
mem_stat
The mem_stat group contains metrics for tracking the size of the work area memory.
The work area memory can be roughly described as the memory used by internal sorting operations and hash tables. The size of this memory is controlled by the global configuration parameter work_mem: when the limit is exceeded, temporary files on the disk are used instead of RAM. The work_mem parameter limits the memory used by a single operation.
The mem_stat group includes three types of metrics with the common name work_area:
- qhb.mem_stat.work_area shows the total size of work area for the entire cluster, in bytes
- qhb.mem_stat.db_
.work_area shows the size of work area for one database, in bytes - qhb.mem_stat.
.work_area shows the size of work area for one worker process, in bytes
Metrics of the second and third types are created dynamically.
All metrics are of type gauge, meaning their values may decrease as previously occupied memory is freed.
temp_stat
The temp_stat group includes metrics for QHB temporary files and tables.
Table 55. Metrics for temporary files and tables
| Name | Description | Type |
|---|---|---|
| qhb.temp_stat.temp_files | Number of temporary files created during the aggregation period | counter |
| qhb.temp_stat.temp_bytes | Total size of temporary files created during the aggregation period, in bytes | counter |
| qhb.temp_stat.temp_tables | Number of temporary tables created during the aggregation period | counter |
| qhb.temp_stat.temp_table_bytes | Total size of temporary tables created during the aggregation period, in bytes | counter |
Data is collected for the aggregation period of the metric collector. Data is
updated after the corresponding operation is completed, for example, when creating
a temporary table with the DROP ON COMMIT clause, statistics are updated after
the COMMIT command is executed during deleting the temporary table. In some
cases, the space occupied by temporary files and tables can be quite significant.
Database versions of these metrics are also generated. The metric names include the database ID, for example: qhb.temp_stat.db_16384.temp_bytes.
wal
The wal group contains metrics for the write-ahead log archiving process.
Table 56. WAL Archiving Process Metrics
| Name | Description | Type |
|---|---|---|
| qhb.wal.archived | Number of successful write-ahead log file archiving operations. | counter |
| qhb.wal.failed | Number of failed archiving attempts | counter |
| qhb.wal.archive_time | Time spent copying log files, in nanoseconds | counter |
These metrics work if write-ahead log archiving is configured by setting archive_mode to on and defining the archiving command in archive_command.
transaction
The transaction group includes transaction metrics.
Table 57. Transaction Metrics
| Name | Description | Type |
|---|---|---|
| qhb.transaction.commit | Number of transaction commits | counter |
| qhb.transaction.rollback | Number of transaction rollbacks | counter |
| qhb.transaction.deadlocks | Number of deadlocks | counter |
Metric data is collected directly when committing and rolling back transactions at the database cluster level.
The transaction commit ratio is calculated as a percentage of transaction commits to the sum of transaction commits and rollbacks:
k = commit/(commit + rollback)*100%
Typically, the value tends to 100%, since transactions are usually completed successfully. A significant portion of rolled-back transactions may indicate that there are problems in the system.
Deadlocks occur when different sessions mutually wait for locked data to be released. In this case, after the deadlock is automatically detected, one of the transactions is rolled back.
wait
The wait group contains wait event metrics.
This set of metrics fully corresponds to the set of standard wait events. The metrics have the prefix qhb.wait. Also the name contains the wait event class and the wait event name after a dot. In the current release, metric names are limited in size and have a maximum of 31 characters in the name. All wait event metrics have the counter type, but the value reflects the amount of time in microseconds that these waits took as a whole across all sessions during the collection period.
Note
If there were many user connections in the waiting state during the observation period, the total waiting time could be many times greater than the observation period. For example, if 1,000 sessions spent one second waiting over 10 seconds, the total waiting time would be 1,000 seconds.
Metrics with wait event types Lock, LWLock, IO, and IPC are displayed in the "Wait Events" section of the QHB dashboard. The metric values are displayed in microseconds (automatically converted to other units when the values increase). The existing graphs do not display all wait events, but only the five most significant by value on each graph. Different wait events may have very different durations. Great value variability may reflect emerging problems.
Table 58. The Most Significant Wait Events
| Name | Description |
|---|---|
| Lock.extend | Waiting for relation to be extended. Becomes noticeable when tables are actively growing. The need to allocate new blocks leads to some delays, which are reflected in this metric. |
| Lock.transactionid | Waiting for transaction to be completed. It occurs when a transaction is forced to wait for previous transactions to complete their processing, which also receive confirmation of their completion. |
| Lock.tuple | Waiting for acquiring a lock on a tuple. It occurs when multiple transactions are working on the same data at the same time. |
| LWLock.WALWriteLock | Waiting for WAL buffers to be written to disk. Often it is the leader of this type of wait event, as disk operations are the slowest of this wait event group. |
| LWLock.wal_insert | Waiting for WAL to be inserted into memory buffer. |
| LWLock.buffer_content | Waiting for a data page to be read or written to memory. It occurs and becomes significant during intensive I/O. |
| LWLock.buffer_mapping | Waiting for a data block to be mapping with a buffer in the buffer pool. |
| LWLock.lock_manager | Waiting for locks for backend processes to be added or checked. Event becomes significant when transactions are frequent. |
bufmgr
The bufmgr group contains buffer manager metrics related to memory management mechanisms. Metrics are collected during data read operations.
Table 59. Buffer Manager Metrics
| Name | Description | Type | Aggregate |
|---|---|---|---|
| qhb.bufmgr.BufferAlloc | Number of times the buffer was searched | timer | count |
| qhb.bufmgr.BufferAlloc | Total time spent searching for the buffer | timer | sum |
| qhb.bufmgr.happy_path | Number of times the buffer was found immediately | timer | count |
| qhb.bufmgr.happy_path | Total time spent searching when the buffer was found immediately | timer | sum |
| qhb.bufmgr.cache_miss | Number of buffer cache misses | timer | count |
| qhb.bufmgr.cache_miss | Total time spent processing buffer cache misses | timer | sum |
| qhb.bufmgr.disk_read | Number of tuple reads from disk (asynchronous) | timer | count |
| qhb.bufmgr.flush_dirty | Number of times the tuple was flushed to disk (asynchronous) | timer | count |
| qhb.bufmgr.retry_counter | Number of miss re-processings | counter | |
| qhb.bufmgr.strategy_pop_cnt | Number of times a specific buffer popping or eviction strategy was triggered | counter | |
| qhb.bufmgr.strategy_reject_cnt | Number of rejected buffers proposed by the special strategy | counter | |
| tarq_cache.allocate | Number of times the TARQ search was performed | timer | count |
| tarq_cache.allocate | Total time spent on the TARQ search | timer | sum |
| tarq_cache.allocate_new | Number of times the block to be excluded was selected in TARQ | timer | count |
| tarq_cache.rollback | Number of eviction rollbacks in TARQ | timer | count |
| tarq_cache.rollback | Total time spent on eviction rollbacks in TARQ | timer | sum |
| tarq_cache.touch | Total time spent counting popular tuples in TARQ | timer | sum |
lmgr
The lmgr group contains lock manager metrics. Locks are either primary or predicate. Predicate locks are called "locks" for historical reasons; they do not currently lock anything, but are used to track data dependencies between transactions.
The lock manager metrics track the number of busy locks and serializable transactions (transactions with the highest isolation level, Serializable), as well as the number of free slots in the hash tables for each parameter. These metrics are necessary because the sizes of the hash tables storing locks and serializable transactions are limited, and exceeding the limit may cause an out-of-memory message.
The metrics are collected during data read operations.
Table 60. Lock Manager Metrics
| Name | Description | Type |
|---|---|---|
| qhb.lmgr.locks | Number of busy locks | counter |
| qhb.lmgr.locks_available | Number of locks that can still be created | counter |
| qhb.lmgr.proc_locks | Number of busy PGPROC locks | counter |
| qhb.lmgr.proc_locks_available | Number of PGPROC locks that can still be created | counter |
| qhb.lmgr.serializable_xids | Number of active serializable transactions | counter |
| qhb.lmgr.serializable_xids_available | The number of serializable transactions that can still be created | counter |
| qhb.lmgr.pred_locks | Number of busy predicate "locks" | counter |
| qhb.lmgr.pred_locks_available | The number of predicate "locks" that can still be created | counter |
| qhb.lmgr.pred_lock_targets | Number of busy predicate target "locks" | counter |
| qhb.lmgr.pred_lock_targets_available | The number of predicate target "locks" that can still be created | counter |
queryid
The queryid group contains metrics that provide the ability to track the planning and execution statistics of (almost) all SQL statements on the server.
This group is distinctive in that its metrics are created dynamically when the
engine starts processing a new operator. A unique name or query ID is added to
the names of new metrics. For example, the metric qhb.queryid.c92e145f160e7b9e.exec_calls
reflects the execution number for a certain SQL statement. The text of the
statement itself can be obtained from the new system table qhb_queryid
using a query like SELECT * FROM qhb_queryid WHERE qid = 'c92e145f160e7b9e'.
More details on the queryid group metric configuration are described in the Section queryid Metric Configuration.
Table 61. queryid Metrics
| Name | Description | Type |
|---|---|---|
| qhb.queryid.id.plan_calls | Number of the operator plans | counter |
| qhb.queryid.id.total_plan_time | Total time spent planning the operator, in nanoseconds | counter |
| qhb.queryid.id.min_plan_time | Minimum time spent planning the operator, in nanoseconds | gauge |
| qhb.queryid.id.max_plan_time | Maximum time spent planning the operator, in nanoseconds | gauge |
| qhb.queryid.id.mean_plan_time | Mean time spent planning the operator, in nanoseconds | gauge |
| qhb.queryid.id.stddev_plan_time | Standard deviation of operator planning time, in nanoseconds | gauge |
| qhb.queryid.id.exec_calls | Number of the operator executions | counter |
| qhb.queryid.id.total_exec_time | Total time spent executing the operator, in nanoseconds | counter |
| qhb.queryid.id.min_exec_time | Minimum time spent executing the operator, in nanoseconds | gauge |
| qhb.queryid.id.max_exec_time | Maximum time spent executing the operator, in nanoseconds | gauge |
| qhb.queryid.id.mean_exec_time | Mean time spent executing the operator, in nanoseconds | gauge |
| qhb.queryid.id.stddev_exec_time | Standard deviation of operator executing time, in nanoseconds | gauge |
| qhb.queryid.id.shared_blks_hit | Total number of shared block cache hits for this operator | counter |
| qhb.queryid.id.shared_blks_read | Total number of shared block reads for this operator | counter |
| qhb.queryid.id.shared_blks_dirtied | Total number of shared block dirtied by this operator | counter |
| qhb.queryid.id.shared_blks_written | Total number of shared blocks written by this operator | counter |
| qhb.queryid.id.local_blks_hit | Total number of local block cache hits for this operator | counter |
| qhb.queryid.id.local_blks_read | Total number of local block reads for this operator | counter |
| qhb.queryid.id.local_blks_dirtied | Total number of local blocks dirtied by this operator | counter |
| qhb.queryid.id.local_blks_written | Total number of local blocks written by this operator | counter |
| qhb.queryid.id.temp_blks_read | Total number of temporal block reads for this operator | counter |
| qhb.queryid.id.temp_blks_written | Total number of temporal blocks written by this operator | counter |
| qhb.queryid.id.exec_rows | The total number of rows retrieved or affected by this operator | counter |
| qhb.queryid.id.blk_read_time | Total time spent by the operator reading blocks, in milliseconds (if track_io_timing is enabled; otherwise 0) | counter |
| qhb.queryid.id.blk_write_time | Total time spent by the operator writing blocks, in milliseconds (if track_io_timing is enabled; otherwise 0) | counter |
| qhb.queryid.id.wal_records | Total number of WAL records generated by this operator | counter |
| qhb.queryid.id.wal_fpi | Total number of full WAL page images generated by this operator | counter |
| qhb.queryid.id.wal_bytes | Total amount of WAL generated by this operator, in bytes | counter |
qss
The qss group contains metrics related to encryption/decryption processes when working with tables processed by QSS.
Table 62. Metrics for qss tables
| Name | Description | Type |
|---|---|---|
| qhb.qss2.wals.encrypt.count | Number of encrypted WAL messages | counter |
| qhb.qss2.wals.encrypt.bytes | Number of encrypted WAL message bytes | counter |
| qhb.qss2.wals.encrypt.time | Time spent encrypting a WAL message | timer |
| qhb.qss2.wals.decrypt.count | Number of decrypted WAL messages | counter |
| qhb.qss2.wals.decrypt.bytes | Number of decrypted WAL message bytes | counter |
| qhb.qss2.wals.decrypt.time | Time spent decrypting a WAL message | timer |
| qhb.qss2.pages.encrypt.count | Number of encrypted data pages | counter |
| qhb.qss2.pages.encrypt.bytes | The number of encrypted data page bytes | counter |
| qhb.qss2.pages.encrypt.time | Time spent encrypting a data page | timer |
| qhb.qss2.pages.decrypt.count | Number of decrypted data pages | counter |
| qhb.qss2.pages.decrypt.bytes | Number of decrypted data page bytes | counter |
| qhb.qss2.pages.decrypt.time | Time spent decrypting a data page | timer |
| qhb.qss2.blobs.encrypt.count | Number of encrypted non-fixed size objects (Rbytea) | counter |
| qhb.qss2.blobs.encrypt.bytes | The number of encrypted non-fixed size object bytes (Rbytea) | counter |
| qhb.qss2.blobs.encrypt.time | Time spent encrypting non-fixed size objects (Rbytea) | timer |
| qhb.qss2.blobs.decrypt.count | Number of decrypted non-fixed size objects (Rbytea) | counter |
| qhb.qss2.blobs.decrypt.bytes | Number of decrypted non-fixed size object bytes (Rbytea) | counter |
| qhb.qss2.blobs.decrypt.time | Time spent decrypting non-fixed size objects (Rbytea) | timer |
These metrics are generated when QSS is running on the server and an
encrypted table is created (CREATE TABLE <table name> USING qss).
QCP Connection Pool Metrics
Connection pool metrics that characterize the pool's operation are described below. The metrics are collected during query execution operations.
Table 63. QCP Connection Pool Metrics
| Name | Description | Type |
|---|---|---|
| qcp.queue | The number of queries in the queue at the moment | gauge |
| qcp.obtain_backend | Timeout for backend assignment to fulfill client query | timer |
| qcp.obtain_backend_failed | Maximum timeout for backend assignment to fulfill client query exceeded | timer |
| <dbms address>.in_use | Number of used connections (backends) to the DBMS | gauge |
Enabling and Disabling Value Recording for Metric Groups and Categories
QHB has a mechanism that allows you to enable and disable metric sending separately for groups and categories. The data directory contains a configuration file metrics_config.toml, in which each string corresponds to a metric group:
[is_enabled]
bgwr = { basic = true, dbid = false, pid = false }
bufmgr = { basic = true, dbid = false, pid = false }
db_stat = { basic = true, dbid = false, pid = false }
default = { basic = true, dbid = false, pid = false }
lmgr = { basic = true, dbid = false, pid = false }
mem_stat = { basic = false, dbid = false, pid = false }
queryid = { basic = false, dbid = false, pid = false }
sys = { basic = true, dbid = false, pid = false }
temp_stat = { basic = true, dbid = false, pid = false }
transaction = { basic = true, dbid = false, pid = false }
wait = { basic = true, dbid = false, pid = false }
wal = { basic = true, dbid = false, pid = false }
The variables take the values true (sending group metrics is allowed) or false (sending is disallowed).
Note
The qss group is included in group bgwr.
Table 64. Setting up Value Recording for Metric Groups
| Metric Group | Description | Default |
|---|---|---|
| default | Default group (various metrics not assigned to separate groups) | true |
| sys | Operating system metrics | true |
| db_stat | Block read and write metrics | true |
| bgwr | Background writer metrics | true |
| mem_stat | "Work area" memory size metrics | false |
| temp_stat | Metrics for temporary files and tables | true |
| wal | WAL file archiving metrics | true |
| transaction | Transaction metrics | true |
| wait | Wait event metrics | true |
| bufmgr | Metrics of memory management mechanisms | true |
| lmgr | Lock (main and predicate) mechanisms metrics | true |
| queryid | SQL server scheduling and execution metrics | false |
Note
To collect operating system metrics, you must also set the qhb_os_monitoring parameter to on.
To view the list of metric groups and their current state, you can use the SQL function metrics_config_show. An example of calling this function:
SELECT * FROM metrics_config_show();
group_name | glob | dbid | pid
-------------+-------+------+-----
bgwr | t | f | f
bufmgr | t | f | f
db_stat | t | f | f
default | t | f | f
lmgr | t | f | f
mem_stat | f | f | f
queryid | f | f | f
sys | t | f | f
temp_stat | t | f | f
transaction | t | f | f
wait | t | f | f
wal | t | f | f
(12 rows)
To view the current state of metrics for a specific process, you can use the SQL function metrics_transient_config_show. An integer representing the PID is passed as an argument. An example of calling this function:
SELECT * FROM metrics_transient_config_show(1843);
If the metrics configuration for this process has not been changed, the output is the same as the output of the metrics_config_show function.
In addition to setting the settings via a parameters file, it is possible to change these values via functions. Functions can be called with a different number of arguments. If the kind parameter argument is omitted, the function applies to all metric categories. If the respect_transient_settings parameter argument is omitted, its default value is false.
Changes are immediately reflected in the configuration file.
Functions for Enabling and Disabling Sending Metrics
metrics_config_enable_group ( group_name: cstring ) → void
metrics_config_enable_group ( group_name: cstring, respect_transient_settings: boolean ) → void
metrics_config_enable_group ( group_name: cstring, kind: cstring, respect_transient_settings: boolean ) → void
Enables sending values for metrics in a given group/category.
metrics_config_disable_group ( group_name: cstring ) → void
metrics_config_disable_group ( group_name: cstring, respect_transient_settings: boolean ) → void
metrics_config_disable_group ( group_name: cstring, kind: cstring, respect_transient_settings: boolean ) → void
Disables sending values for metrics of the specified group/category.
metrics_config_transient_set ( backend_pid: int4, group_name: cstring, option_value: boolean ) → void
metrics_config_transient_set ( backend_pid: int4, group_name: cstring, kind: cstring, option_value: boolean ) → void
Sets the time limit for sending metric values for the specified backend.
Metric categories are expressed by the following values of the kind parameter: 'basic', 'dbid', 'pid'.
The parameter respect_transient_settings (second or third depending on the function option) in the functions metrics_config_enable_group and metrics_config_disable_group can be used to specify whether to give priority to the metric group sending parameters set at the backend level.
Using the metrics_config_transient_set function, you can enable or disable sending the values of a specified metric group and category for a specific backend. The setting will last for the entire lifetime of the backend, or until you call the metrics_config_enable_group or metrics_config_disable_group function with the parameter respect_transient_settings = false, which in this case will also extend the effect of the command to those backends in which their own parameter values were set.
When starting a backend with ID that was previously used, all temporary settings for this ID are reset.
Note
Everything said about disabling the recording of metric groups does not apply to metrics that are passed through SQL functions. When using these functions, metrics are always recorded.
QHB Metric Dashboards for Grafana
The QHB dashboards for Grafana are located in the repository at the following link.
QHB is supplied with a metric server that records metric data to Graphite. Grafana serves as the interface to these metrics. The current set of dashboards for Grafana are provided as self-documenting samples, from which users can build dashboards that better suit their needs. Of course, the provided dashboards can also be used as is.
Importing Dashboards
Export of JSON description of dashboards is done in Grafana 6.7.2.
Before importing the JSON description, you need to decide whether the metric names will contain the host name as a prefix. This is how the metric names are structured inside the panels, and it is recommended to leave this option. At the beginning of the metric names, the $server_name variable is added, by default, the value your_host_name is selected for it. Before importing, you can replace this value with the name of one of the hosts in the JSON files. Later, via the Grafana interface, you can add all the host names (comma-separated) from which metrics will be collected into this variable. This will allow you to quickly switch from one host to another when viewing metrics. If such a scheme will not be used (if metrics from a single host are viewed), you can remove the $server_name prefix in all metric names in the JSON files before importing the JSON description. However, this is a more laborious option and is not recommended.
To import dashboard descriptions, you need to follow these steps:
- In the Dashboards menu of your Grafana site, select Manage.
- In the opened folder and panel list, select an existing folder or create a new one.
- In the selected folder, select Import in the upper right part of the page.
- On the opened, you can either click the Upload json file button in the upper right corner and upload the file, or paste the contents of the JSON file into the field under the Or paste JSON heading and click the Load button.
- Then you need to fill in the required parameters and download the JSON description.
Operating System Dashboard
The dashboard presents the main system indicators:
- QHB instance running time;
- Average load;
- RAM usage;
- Memory usage;
- Usage of the disk system where the database directory is located.
QHB Dashboard
The dashboard contains several sections:
- Transactions;
- Reading and writing blocks;
- Wait events;
- Checkpoints and buffer operations;
- WAL archiving.
Each section contains sets of thematic dashboards that display key indicators.
Note
By default, tags for various events are not set, so you need to manually configure the output of additional annotations:
- add the necessary comments to the metric data using the qhb_annotation function (see Section Annotations);
- configure the output of available annotations in the Grafana dashboard.
QHB Monitoring for Zabbix
Example of using Zabbix 6.2 server to monitor a host with QHB and Zabbix agent.
Required files:
- template_db_qhb.yaml is the "QHB by Zabbix agent" template file, which needs to be imported to the Zabbix 6.2 server;
- template_db_qhb.conf is the file with user parameters of Zabbix agent for polling QHB;
- the qhb/ directory containing the SQL files accessed by the user parameters.
The archive for Zabbix 6.2 or higher is located in the repository at the following link.
Installation
Note
For more information, see the Zabbix documentation on working with agent templates.
-
Install Zabbix agent on host with QHB
-
Copy the qhb/ directory to the Zabbix agent home directory /var/lib/zabbix. If the zabbix/ directory is absent in /var/lib/, it must be created. The qhb/ directory contains the SQL files required for getting metrics from QHB:
# mkdir -p /var/lib/zabbix/qhb
# cd /var/lib/zabbix/qhb
# wget <server>/zabbix/qhb/qhb.tar
# tar -xvf qhb.tar
# chmod -R 707 /var/lib/zabbix/qhb
# rm -rf qhb.tar
- Copy the template_db_qhb.conf file to the Zabbix agent configuration directory /etc/zabbix/zabbix_agentd.d/:
# wget <server>/zabbix/template_db_qhb.conf
- Create a user zbx_monitor with read-only privileges and access to the QHB cluster:
CREATE USER zbx_monitor WITH PASSWORD '<PASSWORD>' INHERIT;
GRANT pg_monitor TO zbx_monitor;
- Edit the qhb_hba.conf file to allow connection to Zabbix. To do this, open it in the nano text editor and add the following strings into it:
host all zbx_monitor 127.0.0.1/32 trust
- Restart QHB and Zabbix agent:
# systemctl restart qhb
# systemctl restart zabbix_agentd
-
Import the template_db_qhb.yaml file of the "QHB by Zabbix agent" template on the Zabbix server. For more information on importing templates, see the Zabbix documentation on working with importing templates.
-
Set the macro parameters {$PG.HOST}, {$PG.PORT}, {$PG.USER}, {$PG.PASSWORD}, {$PG.DB} for the host with QHB.
-
Attach the "QHB by Zabbix agent" template to the host with QHB.
Collected Parameters
Table 65. Collected Parameters
| Group | Name | Description |
|---|---|---|
| QHB | Bgwriter: Buffers allocated per second | Number of buffers allocated per second |
| QHB | Bgwriter: Buffers written directly by a backend per second | Number of buffers written directly by a backend per second |
| QHB | Bgwriter: Buffers backend fsync per second | Number of times per second the backend had to call fsync itself (usually it is handled by the background writer even when the backend is writing itself) |
| QHB | Bgwriter: Buffers written during checkpoints per second | Number of buffers written during checkpoints per second |
| QHB | Bgwriter: Buffers written by the background writer per second | Number of buffers written by the background writer per second |
| QHB | Bgwriter: Requested checkpoints per second | Number of requested checkpoints that were executed per second |
| QHB | Bgwriter: Scheduled checkpoints per second | The number of scheduled checkpoints that were executed per second |
| QHB | Bgwriter: Checkpoint sync time | Total time spent synchronizing files to disk while processing checkpoints |
| QHB | Bgwriter: Checkpoint write time | Total time spent writing files to disk during checkpoint processing, in milliseconds |
| QHB | Bgwriter: Max written per second | How many times per second the background writer stopped the flushing scan because it had written too many buffers |
| QHB | Status: Cache hit ratio % | Cache hit rate |
| QHB | Status: Config hash | QHB configuration hash |
| QHB | Connections sum: Active | Total number of connections executing queries |
| QHB | Connections sum: Idle | Total number of connections waiting for a new command from the client |
| QHB | Connections sum: Idle in transaction | Total number of connections in transaction state but not executing a query |
| QHB | Connections sum: Prepared | Total number of prepared transactions |
| QHB | Connections sum: Total | Total number of connections |
| QHB | Connections sum: Total % | Total number of connections in percent |
| QHB | Connections sum: Waiting | Total number of waiting transactions |
| QHB | Status: Ping time | Ping time |
| QHB | Status: Ping | Ping |
| QHB | Replication: standby count | Number of standby servers |
| QHB | Replication: lag in seconds | Replication lag from primary server in seconds |
| QHB | Replication: recovery role | Replication role: 1 - recovery still ongoing (standby server mode), 0 - primary server mode |
| QHB | Replication: status | Replication status: 0 - streaming is decreasing, 1 - streaming is increasing, 2 - primary server mode |
| QHB | Transactions: Max active transaction time | Current maximum duration of an active transaction |
| QHB | Transactions: Max idle transaction time | Current maximum duration of an idle transaction |
| QHB | Transactions: Max prepared transaction time | Current maximum duration of a prepared transaction |
| QHB | Transactions: Max waiting transaction time | Current maximum duration of a waiting transaction |
| QHB | Status: Uptime | Total system uptime |
| QHB | Status: Version | QHB version |
| QHB | WAL: Segments count | Number of WAL segments |
| QHB | WAL: Bytes written | Size of WAL records in bytes |
| QHB | DB {#DBNAME}: Database size | This database size |
| QHB | DB {#DBNAME}: Blocks hit per second | Number of times disk blocks were found already in the buffer cache, so there was no need to read them |
| QHB | DB {#DBNAME}: Disk blocks read per second | Total number of disk blocks read in this database |
| QHB | DB {#DBNAME}: Detected conflicts per second | Total number of queries cancelled due to recovery conflicts in this database |
| QHB | DB {#DBNAME}: Detected deadlocks per second | Total number of deadlocks detected in this database |
| QHB | DB {#DBNAME}: Temp_bytes written per second | The total amount of data written to temporary files by queries in this database. |
| QHB | DB {#DBNAME}: Temp_files created per second | The total number of temporary files created by queries in this database. |
| QHB | DB {#DBNAME}: Tuples deleted per second | The total number of tuples deleted by queries in this database |
| QHB | DB {#DBNAME}: Tuples fetched per second | The total number of tuples fetched by queries in this database |
| QHB | DB {#DBNAME}: Tuples inserted per second | The total number of tuples inserted by queries in this database |
| QHB | DB {#DBNAME}: Tuples returned per second | The total number of tuples returned by queries in this database |
| QHB | DB {#DBNAME}: Tuples updated per second | The total number of tuples updated by queries in this database |
| QHB | DB {#DBNAME}: Commits per second | Number of committed transactions in this database |
| QHB | DB {#DBNAME}: Rollbacks per second | Total number of rolled back transactions in this database |
| QHB | DB {#DBNAME}: Frozen XID before autovacuum % | Percentage of frozen XIDs before autovacuum |
| QHB | DB {#DBNAME}: Frozen XID before stop % | Percentage of frozen XIDs before stop |
| QHB | DB {#DBNAME}: Locks total | Total number of locks in this database |
| QHB | DB {#DBNAME}: Queries slow maintenance count | Slow maintenance query counter |
| QHB | DB {#DBNAME}: Queries max maintenance time | Maximum duration of a maintenance query |
| QHB | DB {#DBNAME}: Queries sum maintenance time | Total duration of maintenance queries |
| QHB | DB {#DBNAME}: Queries slow query count | Slow query counter |
| QHB | DB {#DBNAME}: Queries max query time | Maximum duration of a query |
| QHB | DB {#DBNAME}: Queries sum query time | Total duration of queries |
| QHB | DB {#DBNAME}: Queries slow transaction count | Slow transaction query counter |
| QHB | DB {#DBNAME}: Queries max transaction time | Maximum duration of a transaction query |
| QHB | DB {#DBNAME}: Queries sum transaction time | Total duration of transaction queries |
| QHB | DB {#DBNAME}: Index scans per second | Number of index scans in this database |
| QHB | DB {#DBNAME}: Sequential scans per second | Number of sequential scans in this database |
| Zabbix source items | QHB: Get bgwriter | Background recording process activity statistics |
| Zabbix source items | QHB: Get connections sum | Collect all metrics from *pg_stat_activity |
| Zabbix source items | QHB: Get dbstat | Collect all metrics from pg_stat_database for each database |
| Zabbix source items | QHB: Get locks | Collect all metrics from pg_locks for each database |
| Zabbix source items | QHB: Get queries | Collect all metrics on query execution time |
| Zabbix source items | QHB: Get transactions | Collect transaction execution time metrics |
| Zabbix source items | QHB: Get WAL | The main item for collecting WAL metrics |
| Zabbix source items | DB {#DBNAME}: Get frozen XID | Number of frozen XIDs |
| Zabbix source items | DB {#DBNAME}: Get scans | The number of scans performed on a table/index in this database |
Triggers
Table 66. Triggers
| Name | Description |
|---|---|
| QHB: Required checkpoints occurs too frequently | Required checkpoints occur too frequently |
| QHB: Cache hit ratio too low | Too low cache hit ratio |
| QHB: Configuration has changed | Configuration has changed |
| QHB: Total number of connections is too high | Total number of connections is too high |
| QHB: Response too long | Response time too long |
| QHB: Service is down | QHB is down |
| QHB: Streaming lag with {#MASTER} is too high | Replication lag from primary server is too high |
| QHB: Replication is down | Replication is down |
| QHB: Service has been restarted | QHB operating time less than 10 minutes |
| QHB: Version has changed | QHB version has changed |
| DB {#DBNAME}: Too many recovery conflicts | Too many conflicts between primary and standby servers during recovery |
| DB {#DBNAME}: Deadlock occurred | Deadlock occurred |
| DB {#DBNAME}: VACUUM FREEZE is required to prevent wraparound | VACUUM FREEZE is required to prevent XID wraparound |
| DB {#DBNAME}: Number of locks is too high | Number of locks is too high |
| DB {#DBNAME}: Too many slow queries | Too many slow queries |
| QHB: Failed to get items | Zabbix has not received data for items in the last 30 minutes |
Setting up Metric Collection
In order for the dashboard to display metric data, you should perform some settings.
Metric Server Configuration
Metric server configuration is described in Chapter Metric Server.
It is recommended to specify the host name on which the metric server is running in the prefix parameter of the metric server's configuration file /etc/metricsd/config.yaml. If you do this for each server, all metrics will be organized hierarchically, and the first level of the hierarchy will be the server level. That is why the metric names in the proposed dashboards contain the $server_name variable. It is assumed that only one database cluster is running on the host.
Configuring Database Settings
To configure sending metrics before QHB 1.3.0, you need to set the metrics_collector_id parameter in qhb.conf to the value with which the metric collector is launched, for example 1001. Starting with QHB 1.3.0, instead of metrics_collector_id, the metrics_collector_path parameter is used, which by default has the value @metrics-collector (represents the path to the Unix domain socket); the metric server by default starts at this address.
To configure sending annotations, you need to specify the following parameters in qhb.conf:
- grafana.address — Grafana address, for example *http://localhost:3000
- grafana.token — you should specify the token obtained in Grafana at the address http://localhost:3000/org/apikeys
Example of settings in qhb.conf for sending metrics and annotations:
# Before QHB 1.3.0
# metrics_collector_id = 1001
# Starting with QHB 1.3.0:
metrics_collector_path = '@metrics-collector'
grafana.address = 'http://localhost:3000'
grafana.token = 'eyJrIjoiNGxTaloxMUNTQkFUMTN0blZqUTN6REN6OWI5YjM1MzMiLCJuIjoidGVzdCIsImlkIjoxfQ=='
Note
If you need to write metric data to CSV files, the path to the Unix domain socket file must be specified as the value of the metrics_collector_path parameter, for example, /tmp/metrics-collector.sock. The same value must be specified in the bind_addr parameter of the collection section in the metric server configuration (/etc/metricsd/config.yaml).
To collect system metrics (the "Operating System" dashboard), you need to set qhb_os_monitoring to on (enabled). You can also set the system statistics collection period qhb_os_stat_period (the default is 30 seconds). It is not recommended to set this parameter to a value that is too low, since collecting system statistics requires some overhead.
You can specify the following in the parameters file:
qhb_os_monitoring = on
qhb_os_stat_period = 60 # if the default period of 30 seconds does not work for you
Or execute the commands:
ALTER SYSTEM SET qhb_os_monitoring = ON;
ALTER SYSTEM SET qhb_os_stat_period = 60;
SELECT pg_reload_conf();
Examples of Using Metrics in SQL Functions
In addition to the built-in metrics, users can use their own metrics via the following SQL functions.
Timer Metric Type
It is used to record a period of time, units of measurement are nanoseconds.
SELECT qhb_timer_report('qhb.timer.nano',10000000000 /* 10 seconds in nanoseconds */);
Counter Metric Type
It is used when it is necessary to record the number of events that occurred over a period of time.
SELECT qhb_counter_increase_by('qhb.example.counter',10);
Gauge Metric Type
It is used when you need to set a static indicator to a certain value or change it.
SELECT qhb_gauge_update('qhb.gauge_example.value', 10); /* Set the value */
SELECT qhb_gauge_add('qhb.gauge_example.value',1); /* Increase the value */
SELECT qhb_gauge_sub('qhb.gauge_example.value',1); /* Decrease the value */
Annotations
It is used when you need to add a comment to the metric data. The first parameter of the function is the comment text, the following parameters are tags.
SELECT qhb_annotation('Начало выполнения теста', 'test','billing'); /* Annotation text and two tags */