QDL Data Bulk Direct Loading Module

The direct data loading module (Quantum Data Loader, QDL) is a utility that allows you to load from a CSV file into a QHB table according to the configuration. The operation speed is significantly higher than with INSERT and COPY commands, due to the use of optimized multithreaded code, the absence of table locks, and bypassing the transaction core.

Supported Field Types

Type	Names
Integers	smallint, int2, integer, int4, int, bigint, int8
Floating point numbers	float4, real, float8, double precision
Serial	smallserial, serial, bigserial
Text	character, char, varchar, character varying, text
Arbitrary precision numbers	decimal, numeric, dec
Boolean	bool, boolean
Identifiers	uuid, oid
Date/time	timestamp, date
Binary	bytea

Note
The current version of QDL allows NaN (Not a Number) values in the data types float and numeric. In addition to normal numeric values and NaN, float also has special values: infinity and -infinity. Note that all values in a CSV file are not quoted.

UUIDs are written as a sequence of lowercase hexadecimal digits, separated by hyphens, into groups, in the following order: a group of 8 digits followed by three groups of 4 digits, and finally a group of 12 digits, for a total of 32 digits. An example of a UUID in this standard form is: a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11. Date/time types support special values such as infinity, -infinity, and epoch.

Synopsis

qdl [flags] [options] [subcommands]

Flags:

-h
--help
Shows help about qdl command line arguments.

-v
--verbose
Prints qdl debugging information.

-V
--version
Prints the version of qdl.

Options:

-c
--config <configuration_file>
Path to the configuration file of qdl.

-d
--data <data_file>
Path to raw data. Use '-' to read from stdin.

Subcommands:

create_table
Generates a SQL script to create a table.

help
Displays a list of available subcommands or help for specified subcommands.

insert_values
Fills the table file with values.

validate
Validates the CSV file containing the data.

The create_table Command

Generates a SQL script that creates a DBMS table according to the configuration, and also fills it with the first three tuples of the CSV table. This is required to build the structure of the table that will be filled with data. It also allows you to get the paths of the files and their names (OID, required for generation by the insert_values command).

qdl --config <configuration_file> --data <data_file> create_table [flags]

Flags:

-h
--help
Shows help about create_table command line arguments.

-V
--version
Prints the version of create_table.

Note
In CSV format, all characters are significant. A value in a table that is padded with spaces or any other characters besides separator will include those characters. This can result in errors when populating the table with data from a system that pads CSV lines with whitespace up to a fixed width. If this problem occurs, you need to process the CSV file and remove the trailing whitespace before loading the data into QHB.

Usage example:

qdl --config config.yml
    --data data.csv
    create_table

The insert_values Command

Loads data from a CSV file into a table in the QHB format. This command is multithreaded; the number of kernels allocated for parallel processing is specified in the configuration file. User can set the threshold percentage for triggering the clean up mechanism for over-allocated memory.

The --out-dir <output_directory> parameter is mandatory for input from the command line.

It is also recommended to make sure in advance that the QHB page layout version is 4. This can be done, for example, using the pageinspect::page_header extension.

qdl --config <configuration_file> --data <data_file> insert_values [options] --out-dir <output_directory> [flags]

Flags:

-h
--help
Shows help about insert_values command line arguments.

-V
--version
Prints the version of insert_values.

Options:

-o
--out-dir <output_directory>
Output_directory.

-r
--ram-usage-percent <ram_%>
Percentage of RAM used before the garbage collector runs for internal buffers (1 to 99; default is 80).

Usage example:

qdl --config config.yml \
    --data data.csv \
    insert_values \
    --out-dir output/

The validate Command

Checks the logical integrity of the CSV file and the consistency of the data structure with the configuration file, reporting about errors with the appropriate exit status.

    qdl --config <configuration_file> --data <data_file> validate [flags]

Flags:

-h
--help
Shows help about validate command line arguments.

-V
--version
Prints the version of validate.

Usage example:

qdl --config config.yml
    --data data.csv
    validate

The Configuration File

The configuration file has the following sections:

general — basic application configuration. It contains the following fields:
- threads (integer) - the number of raw data processing threads.
- chunk_size (integer, optional) — the size of the internal data transfer buffer; this parameter can be changed for optimization in custom hardware configurations.
- checksum (boolean, optional, the default is true) — specifies the calculation of checksums.
data_source — select data source. In this version the only supported format is csv. It contains the following fields:
- delimiter — column delimiter for CSV.
- double_quote — if true, then two consecutive quotes are treated as one escaped quote.
- escape — character used for escaping.
- comment — symbol marking the beginning of a comment.
output — output data configuration. It contains the following fields:
- segment_size (integer, optional) — the maximum number of pages in the database segment.
- oid (integer) - the main table ID.
- toast_oid (integer, optional) — the TOAST table ID.
table — table configuration. It contains the following fields:
- name (string) - table name.
- fields (repeating field) - description of each field (for example, field_1 : varchar and optional [nullable] mark).

Example Configuration File

# System configuration:
general:
    # Number of threads allocated for string parsing and serializing.
    # Note that in addition to these threads, two more are always allocated:
    # a thread for reading and a thread for writing to the file.
    threads: 2
    checksum: true

# Raw data configuration:
data_source:
    csv:
        # Column delimiter for CSV:
        delimiter: ";"

# Output data configuration:
output:
    # Object ID. Specifies an integer ID for the group of files in the target table.
    oid: 16385
    # Table ID of the table for storing data bulks
    toast_oid: 16386

# Target table configuration:
table:
    # Table name. It is required to specify the name of the table to be created.
    name: "t_qdl"
    # Columns in the "name": "type" format
    fields:
        id : integer
        social_id : integer [nullable]
        first_name : varchar
        last_name : varchar
        city : varchar (30)
        profession : varchar (30) [nullable]
        salary : double precision
        description : varchar [nullable]

Usage Scenario

WARNING!
To avoid access rights issues, all subsequent operations must be performed by the user who runs the database instance.

STEP 0 (optional):

Check the CSV file for consistency with the configuration. This can save a significant amount of time in cases where a large CSV file is converted, since if an error is detected during the execution of the insert_values command, it will have to be executed again after its causes have been corrected.

qdl --config config.yml --data data.csv validate

STEP 1:

Generate a SQL script to create a table to which you plan to load data, and then use it in the DBMS. This script also gets an OID (required for the insert_values configuration file) and the paths to the database files. These operations can be performed manually, but to avoid possible errors, it is recommended to use a script.

qdl --config config.yml --data data.csv create_table

STEP 2:

Execute the SQL script. The OID obtained during execution must be placed in the config.yml configuration file.

STEP 3:

Load the table. This process may take a considerable amount of time, depending on loaded data size. The threads that perform reading/converting/writing have specific names; this will allow the system administrator to assess the load on each individual kernel thread and determine their number so as to achieve optimal performance. It should also be noted that the internal qdl buffers have no size limit, which means that with very slow disk writes, it is possible that these buffers will overflow the RAM.

qdl --config config.yml --data data.csv insert_values --out-dir ./output_dir/

STEP 4:

Replace DBMS files with those generated using qdl — copy files using operating system commands. After that, you need to execute the following query:

SELECT qhb_attach_qdl_data('table_name');

This resets the table data cache and the corresponding TOAST table (if any), reconnects the table data files, reindexes the TOAST table, and registers data for the CDC (Change Data Capture) feature.

Note
To read from stdin, use the vertical bar (|). Examples of using commands:

create_table command:
cat data.csv | qdl --config config.yml --data - create_table
validate command:
cat data.csv | qdl --verbose --config config.yml --data - validate
insert_values command:
cat data.csv | qdl --config config.yml --data - insert_values --out-dir output/
Print the help:
qdl -h