Qbayes

Qbayes — methods of statistical analysis based on Bayes' conditional probabilities.

Extension Purpose

The QHB Qbayes extension is a toolkit created by the QHB team for the tasks of calculating probabilistic models using Bayesian networks.

The source data for various calculations can be stored in tables and views within the QHB database. Such data includes historical facts (hypotheses that have been triggered or not) and the corresponding reasons (parameters). The parameters are considered to be logical (binary) values or scalar quantities, which are cast to discrete values by expansion methods and divided into ranges defined by the user. The values of the hypotheses are always assumed to be logical (binary).

After preliminary statistical processing of historical facts (training), the model allows to calculate the conditional probabilities of the same hypotheses under new conditions (new parameter values).

Bayesian Prediction Algorithms

The following algorithms are currently used in the extension.

Naive Bayes Classifier

A simple classifier based on the assumption of independence of parameters.

This type of model is used in training with the parameter model_type = 'naive' (Naive Bayesian classificator).

Classifier Based on Mutual Information Function

Mutual information is a statistical function of two random variables, describing the amount of information contained in one random variable relative to the other.

The method allows to determine dependent parameters for hypotheses and then use these dependencies to predict hypotheses.

This type of model is used in training with the parameter model_type = 'mi' (Mutual Information).

Markov Chain Monte Carlo Methods

A class of algorithms that model some probability distribution. In the Qbayes, one of these sampling algorithms is used to establish dependent parameters for hypotheses.

The Qbayes currently uses one of the implementations of the Metropolis-Hastings algorithm, and other algorithms may be added in the future.

This type of model is used in training with the parameter model_type = 'mcmc' (Markov chain Monte Carlo).

Extension Methods

When creating a model, a model description table is created:

CREATE TABLE IF NOT EXISTS qbayes_model_description (
    model_name VARCHAR NOT NULL,
    model_parameters json NOT NULL,
    CONSTRAINT qbayes_model_un UNIQUE (model_name)
);

Then, after training with a statistical sample, a table with data is created:

CREATE TABLE IF NOT EXISTS qbayes_model_data (
    model_name VARCHAR NOT NULL,
    graph_name VARCHAR NOT NULL,
    data json,
    CONSTRAINT qbayes_model_graph_un UNIQUE (model_name, graph_name)
);

Note
The specified tables are created automatically when the corresponding methods are called in the current default database schema. If the tables already exist, they are updated. Uniqueness constraints apply to the model name in the first table and the model name with the graph name in the second.

Model Definition

The following method is used to define the model:

CREATE OR REPLACE FUNCTION qbayes_create_model(
    model_name cstring,
    description json)
returns BIGINT AS 'MODULE_PATHNAME' language c volatile security definer;

Return value

If successful, 0 is returned; if unsuccessful, an error is generated.

Options:

model_name specifies the model name that must be unique within the model description table.
description describes the parameters and hypotheses of the model. Each parameter or hypothesis is described by a json expression.

Fields:

"name" — symbolic name of a parameter or hypothesis,
"type" — type of the parameter or hypothesis in the original representation; only types bool (logical), discrete_integer (discrete when enumerating) or float8 (floating point, when creating ranges of values) are supported,
"distrib" — the type of a parameter or hypothesis in the model; only types bool or discrete are supported,
"ranges" — interval division boundaries for creating a discrete parameter, used only with a parameter of type float8,
"acceptable" - values allowed in enumeration, used only together with a parameter of type discrete_integer,
"is_hypothesis" — hypothesis flag, the default is false.

Types of Parameters and Hypotheses

All Bayesian algorithms in the extension use only binary and discrete types of model parameters, and hypotheses use only binary types.

To ensure the ability to operate with different types in the original representation, various combinations of the "type" and "distrib" fields in the model definition are allowed.

Binary Parameters

...
"type": "bool",
"distrib": "bool",
...

In this case, the parameter is considered binary both in the external representation and in the internal representation of the model. Both when training the model and when calculating the predicted hypothesis values, the acceptable input values are true and false.

Discrete Parameters When Splitting into Ranges

...
"type": "float8",
"distrib": "discrete",
"ranges": [r0, r1, ... ,rN]

In this case, the parameter is considered discrete in the internal representation of the model. The values in the external representation are numbers of the type float8, which are then divided into ranges by the boundaries r0, r1, ..., rN and in the internal representation specify the number of the range into which the value falls.

The ranges are specified by the ranges field. The list of values in its array specifies the boundaries of the ranges, and N+1 ranges are created according to the following rules:

0: (-infinity; r0)
1: [r0; r1)
...
N: [r{N-1}; rN)
N+1: [rN; +infinity)

Both when training the model and when calculating the predicted values of hypotheses, values of the type float8 are passed to the input parameters, the extension methods themselves determine the range number.

Discrete Parameters in Enumeration

...
"type": "discrete_integer",
"distrib": "discrete",
"acceptable": [m0, m1, ... ,mN]

In this case, the parameter is considered discrete in the internal representation of the model. The value in the external representation is a discrete value from 0 to N from the acceptable field. The parameter values themselves can be arbitrary integers from the listed m0, m1, ..., mN, not necessarily consecutive or starting with 0.

Both when training the model and when calculating the predicted values of hypotheses, the values from the acceptable field are passed to the input parameters, the extension methods themselves determine the discrete value matching the parameter value. If the external value does not match any of the listed m0, m1, ..., mN, then an error is generated.

External and Internal Representation of Hypotheses

Only one combination of "type" and "distrib" is allowed for hypotheses:

...
"type": "bool",
"distrib": "bool",
...

Model Training

Data (facts) with selected parameters can be contained either in already prepared tables (relation) or in a view (view) created based on these tables

Limitations for tables and views:

They should contain a value field of json data type, which contains at least the same fields as those declared when defining the model. Other fields may exist, but they are not taken into account during training.
Binary or discrete parameters are supported (parameter independence is assumed; the model type used is Naive Bayes).

The training is performed by the following method:

CREATE OR REPLACE FUNCTION qbayes_learn(
  model_name cstring,
  graph_name cstring,
  model_type cstring,
  relation_name cstring,
  iteration_number BIGINT)
returns BIGINT AS 'MODULE_PATHNAME' language c volatile security definer;

Return value

If successful, 0 is returned; if unsuccessful, an error is generated.

Options:

model_name specifies the model name;
name_graph specifies the graph name;
model_type specifies the model type ('naive', 'mi', or 'mcmc'; see Bayesian Prediction Algorithms);
relation_name specifies prepared initial data with facts and hypotheses;
iteration_number specifies the number of the algorithm steps to converge to a stationary distribution with an acceptable error.

Note
The parameter iteration_number has values only for the mcmc model type, for the others it is simply ignored.

Additional Model Training

Note
It makes sense to perform additional training only for the mcmc type model; for other types, the additional training function simply does nothing.

CREATE OR REPLACE FUNCTION qbayes_learn_again(
  model_name cstring,
  graph_name cstring,
  model_type cstring,
  relation_name cstring,
  iteration_number BIGINT)
returns DOUBLE PRECISION AS 'MODULE_PATHNAME' language c volatile security definer;

Return value

A floating point number from 0 to 1 is returned, reflecting the degree of coincidence of the resulting distribution with the ideal, obtained as the average for all hypotheses of the model on the training sample data.

If unsuccessful, an error is generated.

Options

The options are similar to the qbayes_learn, but the distribution is refined by number_iterations.

Prediction

The calculation of the probabilities of hypotheses for given parameters is performed as follows:

CREATE OR REPLACE FUNCTION qbayes_solve(
  model_name cstring,
  graph_name cstring,
  elementary_event json)
returns json AS 'MODULE_PATHNAME' language c volatile security definer;

Return value

If successful, a json is returned, which lists the pairs:

symbolic hypothesis name,
probability with which this hypothesis will come true.

If unsuccessful, an error is generated.

Options:

model_name specifies the model name;
name_graph specifies the graph name;
elementary_event specifies a vector of parameters for calculating hypotheses in the form of a json expression.

There is an additional method for calculating the probabilities of hypotheses for a set of parameter vectors. It is fed a set of parameters in the same form as when training the model:

CREATE OR REPLACE FUNCTION qbayes_solve_view(
  model_name cstring,
  graph_name cstring,
  relation_name cstring)
returns setof json AS 'MODULE_PATHNAME' language c volatile security definer;

Note
Additionally, a group of methods for processing a large number of hypotheses is provided: qbayes_learn_mt, qbayes_solve_mt, and qbayes_solve_view_mt. They are executed in multi-threaded mode and allow to speed up training and calculation. In terms of parameters, these methods are completely the same as the corresponding single-threaded versions.
It should be noted that for a small number of hypotheses, speedup may not occur; moreover, due to the overhead of organizing multi-threaded processing, a slowdown may be observed.

Cleaning

Additionally, methods for cleaning data and models are offered:

CREATE OR REPLACE FUNCTION qbayes_drop_graph(model_name cstring, graph_name cstring)
returns BIGINT AS 'MODULE_PATHNAME' language c volatile security definer;

CREATE OR REPLACE FUNCTION qbayes_drop_model(model_name cstring)
returns BIGINT AS 'MODULE_PATHNAME' language c volatile security definer;

Return value

If successful, 0 is returned; if unsuccessful, an error is generated.

Limitations and Parameters

Limitations

Normally, probabilistic models contain a large number of different parameters and hypotheses. On small models, where the number of parameters and hypotheses is in the tens, most likely there will be no difficulties. When working with large models, it is useful to know some facts.

The json data type was deliberately chosen for various data in Qbayes.

This type of description is chosen because it allows for a larger number of parameters to be passed, compared to, for example, DBMS tables. A table can have no more than 1600 columns (see Chapter QHB Features and Benefits), while one json string can have several thousand or even tens of thousands of descriptions of parameters and hypotheses.

However, the json data type also has its limitations:

The length of a string must not exceed 1 GB if it is a character string (determined by varlena constraints).
This limit is set for the symbolic data representation, so if it contains binary data, it is limited to half the limit.

These constraints must be met by json strings with descriptions of model parameters, training data, trained models (data fields of the qbayes_model_data table), and prediction results.

In some cases, when algorithms are running, a situation is created where there is an exponential growth in the amount of resources required for calculations. Sometimes such a situation is called a combinatorial explosion. It can be caused, for example, by too many variants of values for dependent variables (parameters), although for a qualitative determination of the probabilities of hypotheses, their main values, which make the greatest contribution, are sufficient. To prevent such situations, Qbayes has the ability to limit the consumption of resources using parameters.

Parameters

qbayes.max_dependencies
This parameter is applicable for the mi (Mutual Information) and mcmc (Markov chain Monte Carlo) algorithms. It specifies the maximum number of dependent variables that a hypothesis can have. The default is 5.

qbayes.max_variants
This parameter is applicable for the mi (Mutual Information) and mcmc (Markov chain Monte Carlo) algorithms. It is used to further limit the number of dependent variables that a hypothesis can have: if the number of combinations of dependent variable values exceeds qbayes.max_variants, then the number of dependent variables is reduced until the number of combinations of their values fits into qbayes.max_variants. The recommendations for setting the parameter are as follows: in fact, qbayes.max_variants makes the maximum contribution to the allocated memory, consuming qbayes.max_variants × "number of hypotheses" × 16 bytes. The default is 50,000.

Note
The parameters are set in the configuration file. See Section Setting Parameters.

Usage Example

You need to install the Qbayes extension:

CREATE EXTENSION IF NOT EXISTS qbayes cascade;

Let's create data for the model training (in the example, only one vector of parameters and hypotheses is specified, in reality there should be many):

CREATE OR REPLACE VIEW view1 AS SELECT
'{"size": 15, "volume": 18.35, "is_roundy": true, "h1": false}'::json AS value;

To describe the model:

SELECT qbayes_create_model('MyModel',
  '[{
    "name": "size",
    "type": "discrete_integer",
    "distrib": "discrete",
    "acceptable": [1, 2, 15]
  }, {
    "name": "volume",
    "type": "float8",
    "distrib": "discrete",
    "ranges": [0.1, 1, 10, 100, 1000, 1e4, 1e5, 1e6, 1e7, 1e8]
  }, {
    "name": "is_roundy",
    "is_hypothesis": true,
    "type": "bool",
    "distrib": "bool"
  }, {
    "name": "h1",
    "is_hypothesis": true,
    "type": "bool",
    "distrib": "bool"
  }]'::json);

We can view the model description:

SELECT * FROM qbayes_model_description;

Обучение:

SELECT qbayes_learn('MyModel', 'MyGraph', 'naive', 'view1', 0);

We can view the model data:

SELECT * FROM qbayes_model_data;

Prediction of one event:

SELECT qbayes_solve('MyModel', 'MyGraph', '{ "size": 15, "volume": 1e6 }'::json );

Predicting multiple events (the example uses the same data array, in reality it may be different):

SELECT qbayes_solve_view('MyModel', 'MyGraph', 'view1');

Cleaning:

SELECT qbayes_drop_graph('MyModel', 'MyGraph');

SELECT qbayes_drop_model('MyModel');

Note
Qbayes comes with other examples with extended sets of parameters and hypotheses, using different methods. These examples can be found in the SQL scripts included with the extension.

Notes, Examples and Options for Working with Qbayes

Predictions on a Large Parameter Set

In terms of obtaining the result, the two ways of sampling hypothesis probabilities presented below are the same.

The first way:

select qbayes_solve_view('Model1', 'Graph1', 'input_data');

Here it is assumed that Model1 and Graph1 are some model and graph, respectively. That is, the training stage has already passed. And it does not matter at all what algorithm we were dealing with.

And input_data is a set of parameters by which the calculation of the hypothesis probabilities is made.

The second way:

create function solve_loop() returns setof json
language plpgsql
as $$
declare
  r record;
  rr json;
begin
  for r in (select value from input_data) loop
    select qbayes_solve('Model1', 'Graph1', r.value) into rr;
  return next rr;
end loop;
end
$$

select solve_loop();

But these two ways differ in load and execution speed. In terms of speed, the differences can be both for the worse and for the better. Much depends on the method of selecting results, if there really are a lot of them.

When selecting in the first way, the time for reading the model description, deserialization of the parameter descriptions and hypotheses is reduced — this happens once. In the second case, reading and deserialization occurs with each call of the qbayes_solve function. If the model size is not very large, and the input_data size is not huge, this reduction may be insignificant.

But in the first case, you can take into account the peculiarities of some clients when working with large samples. They can limit the sampling by wrapping the query with LIMIT and OFFSET clauses, thus forming portions of the result. Sometimes this can lead to recalculation of probabilities with each scrolling to the next portion. Moreover, such restrictions do not work effectively with the process memory context on the server. Scrolling may be accompanied by the message "transaction left non-empty SPI stack" in the QHB server syslog.