Part_Import
Part_Import — partial import of data files from an external database.
Description
This extension contains two functions:
-
partial_import(connection_string, table_names) is the main function of the extension, which serves to copy files containing table data from an external database to a local one.
Local database tables are those that are accessible in the current connection session, while external tables are those that we access using the connection_string. The names of the tables to be copied are passed in the table_names parameter. This parameter has the string array type, where each string contains the name of one table. The name can be either fully qualified, i.e. include a namespace, or unqualified, i.e. without a namespace. In the latter case, the server must understand from the unqualified name which specific table is mentioned. -
table_script(oid) is the function that returns a SQL script that allows you to create a table whose OID is passed in the parameter. The script contains indexes, external indexes, constraints, sequences, comments.
partial_import
Partial Import Operation Features:
-
Locks are not applied to external tables, locks on local tables are applied only before switching to "new" data files, and locks are extremely short. The migrated data of several tables is consistent with each other.
-
Since the data files are copied, the import ensures that these files are correct by checking the table metadata of the external and local databases. During the check, we make sure that the names, types and the field order, including remote columns, are completely the same. The column constraints and the column storage type must match. Either both tables or neither of them should have TOAST tables. The number of indexes, their names and column contents must match. It does not matter if the tablespace does not match. Import of materialized table views are also allowed, and there should be no problems with transferring partitions.
-
If there is a foreign key constraint, scripts will be offered to delete them in the local database and restore them after the import is complete. If during the import not only the table with the foreign key is transferred, but also the table to which the foreign key refers, no constraints will arise; the data will be consistent.
-
Since there are no locks on external tables, it is theoretically possible to perform DDL operations on an external table in the window between the completion of the metadata checking and the start of the transfer of files with table data. In this case, the result of the work is unpredictable.
-
During the process, only files with data are transferred, so it is recommended to recompile statistics on imported tables after the import process is complete.
-
Although the import initiates a checkpoint on the local database when it completes, it is theoretically possible for the database to crash between the time it switches to the new files and the time it completes the checkpoint, in which case the crash recovery procedure with the data of the imported tables may produce unpredictable results.
table_script
This function is an adaptation of the pg_dump utility code. The part responsible exclusively for generating table scripts, indexes, constraints, and sequences was taken. All other functionality dedicated to preparing the table contents for dumping, as well as the dumping itself, were discarded.
Test Cases
partial_import
You can test the extension within a single database instance, but full testing should be done on different database instances, and as an external one you should select the one in which you have been working for a long time. The latter can be checked by going to the catalog with the database and looking at the number and size of files in the pg_xact catalog: if there are several files, it means that database activity has been going on for some noticeable time. It is better to create an instance with a local database (i.e. the one to which the import should be performed) from scratch.
The local database should have part_import installed, while in the external one it is enough to have the correct binary files of the required version. You can install the extension using the command:
CREATE EXTENSION part_import;
Now we need to create tables in the external database and fill them with data. For this, we will create three tables: type_a and type_b as regular dictionaries and one more table for TOAST testing and maintaining the functionality of the table with HOT optimization. After the first stage, a fourth table many2many_ab will be added as an "intersection point" of the tables type_a and type_b (emulating the "many to many" relationship). To create the tables type_a, type_b, and accounts, we will use the following script (tables should be created on both the external and local database servers):
CREATE TABLE public.type_a (
id integer NOT NULL,
type_id smallint NOT NULL,
name_1 character varying(40) NOT NULL,
date_ins date,
name_hash_1 integer NOT NULL
);
ALTER TABLE ONLY public.type_a
ADD CONSTRAINT pk_type_a PRIMARY KEY (id);
CREATE INDEX idx_type_a_name_1 ON public.type_a USING btree (name_1);
CREATE INDEX idx_type_a_name_hash_1 ON public.type_a USING btree (name_hash_1);
CREATE TABLE public.type_b (
id integer NOT NULL,
name character varying(40) NOT NULL,
date_ins date,
name_hash integer NOT NULL
);
ALTER TABLE ONLY public.type_b
ADD CONSTRAINT pk_type_b PRIMARY KEY (id);
CREATE INDEX idx_type_b_name ON public.type_b USING btree (name);
CREATE INDEX idx_type_b_name_hash ON public.type_b USING btree (name_hash);
CREATE TABLE public.accounts (
id integer NOT NULL,
number text,
client text,
amount numeric,
CONSTRAINT amount_is_positive CHECK ((amount >= 0.0))
);
ALTER TABLE ONLY public.accounts ALTER COLUMN number SET STORAGE EXTERNAL;
ALTER TABLE public.accounts ALTER COLUMN id ADD GENERATED BY DEFAULT AS IDENTITY (
SEQUENCE NAME public.accounts_id_seq
START WITH 1
INCREMENT BY 1
NO MINVALUE
NO MAXVALUE
CACHE 1
);
ALTER TABLE ONLY public.accounts
ADD CONSTRAINT accounts_number_key UNIQUE (number);
ALTER TABLE ONLY public.accounts
ADD CONSTRAINT accounts_pkey PRIMARY KEY (id);
Then, you need to fill the external database tables with data. In the case of the tables type_a and type_b, this can be almost any data, it is only desirable that the id parameter be linear and start with the value 1, and the amount of data be sufficient for the storage files to have several segments for at least one of the tables. An approximate estimate of the data volume for the table type_a is 10 million rows. For the type_b table, 10 thousand rows are quite enough. The accounts table must be filled in such a way that the rows are sure to get into the TOAST table. There are no special requirements for the number of rows, since testing the type_a table is enough for this parameter. To fill and change the data of the table accounts, we will use the following script (based on Egor Rogov's article):
INSERT INTO accounts(id, "number", client, amount) VALUES(1, 1000, 'masha', 1000.),
(2, 2001, 'vasya', 2000.), (3, 2002, 'misha', 1200.), (4, 2003, 'irina', 1300.);
UPDATE accounts SET client = (
SELECT string_agg( chr(trunc(65+random()*26)::integer), '') FROM generate_series(1,2000)
);
UPDATE accounts SET amount = amount + 100 + id;
UPDATE accounts SET amount = amount + 100 + id;
UPDATE accounts SET amount = amount + 100 + id;
UPDATE accounts SET amount = amount + 100 + id;
After this, the accounts table should have 4 (visible) rows. The last four
UPDATE commands do not change the data in the indexed columns, so the HOT update
optimization must be enabled.
Now we import the data from the tables type_a, type_b, and accounts to the local server, for which we execute the command:
SELECT partial_import('host=<external_db_adderss> user=<external_db_user> dbname=<external_db_name>', ARRAY['type_a', 'type_b', 'accounts']::text[]);
The first parameter is the connection string to the external database. With these
parameters, the user must have sufficient privileges to read the system catalogs
and execute the BASE_BACKUP command.
After the function has finished running in the local database, the filling of the tables type_a, type_b, and accounts must match those in the external database. You should also make sure that the queries:
EXPLAIN ANALYZE SELECT * FROM type_a WHERE id = 7;
EXPLAIN ANALYZE SELECT * FROM type_b WHERE id = 7;
EXPLAIN ANALYZE SELECT * FROM accounts WHERE id = 2;
build an index plan for data retrieving, for example, like this:
Index Scan using pk_type_a on type_a (cost=0.42..8.44 rows=1 width=55) (actual time=0.019..0.020 rows=1 loops=1)
Index Cond: (id = 7)
Planning Time: 0.057 ms
Execution Time: 0.032 ms
and queries:
SELECT * FROM type_a WHERE id = 7;
SELECT * FROM type_b WHERE id = 7;
SELECT * FROM accounts WHERE id = 2;
actually retrieve data (assuming that rows with the specified IDs actually exist).
The described scenario can be expanded as follows:
- Fill the tables type_a, type_b, and accounts as described above;
- Remember the row number in each table;
- Run the script for additional filling of the type_a or type_b table with data and run the import in parallel;
- After the import is complete, make sure that the row number in the table into which the data was loaded in parallel is greater than at the start of the import (in theory, this should be the number at the LSN end of the snapshot, but an exact count is difficult).
Let's create a many2many_ab table in the external and local databases using the following script:
CREATE TABLE public.many2many_ab (
ida integer NOT NULL,
idb integer NOT NULL
);
ALTER TABLE ONLY public.many2many_ab
ADD CONSTRAINT pk_many2many_ab PRIMARY KEY (ida, idb);
CREATE INDEX many2many_ab_idb ON public.many2many_ab USING btree (idb);
ALTER TABLE ONLY public.many2many_ab
ADD CONSTRAINT fk_net_ab_type_a_id FOREIGN KEY (ida) REFERENCES public.type_a(id);
ALTER TABLE ONLY public.many2many_ab
ADD CONSTRAINT fk_net_ab_type_b_id FOREIGN KEY (idb) REFERENCES public.type_b(id);
When trying to execute the previous partial import command, namely:
SELECT partial_import('host=<external_db_adderss> user=<external_db_user> dbname=<external_db_name>', ARRAY['type_a', 'type_b', 'accounts']::text[]);
we should get an error message:
ERROR: Input tables list refer to 2 foreign keys, which not import from remote server.
For guarantee reference integrity you must drop this foreign keys before running
partial import and recreate it when done in an "not valid" state, and than try validate.
The scripts below will help you perform the described actions.
BEGIN;
-- This code will remove all interfering foreign keys --
ALTER TABLE IF EXISTS ONLY public.many2many_ab DROP CONSTRAINT IF EXISTS fk_net_ab_type_a_id;
ALTER TABLE IF EXISTS ONLY public.many2many_ab DROP CONSTRAINT IF EXISTS fk_net_ab_type_b_id;
END;
------------------------------------------------------------------------------------
---- Here you should run the "partial_import" procedure with the same parameters, ----
---- as before. ----
------------------------------------------------------------------------------------
BEGIN;
-- This code will recreate previously deleted foreign keys in the "invalid" state. --
ALTER TABLE ONLY public.many2many_ab
ADD CONSTRAINT fk_net_ab_type_a_id FOREIGN KEY (ida) REFERENCES type_a(id) NOT VALID;
ALTER TABLE ONLY public.many2many_ab
ADD CONSTRAINT fk_net_ab_type_b_id FOREIGN KEY (idb) REFERENCES type_b(id) NOT VALID;
END;
BEGIN;
-- This code will attempt to validate foreign keys that were deleted and then recreated. --
ALTER TABLE IF EXISTS ONLY public.many2many_ab VALIDATE CONSTRAINT fk_net_ab_type_a_id;
ALTER TABLE IF EXISTS ONLY public.many2many_ab VALIDATE CONSTRAINT fk_net_ab_type_b_id;
END;
SQL state: XX000
The error message tells us that we have foreign keys whose integrity we cannot guarantee after the import is complete, and suggests a sequence of actions that will allow the import to be completed, although at the cost of losing temporary key integrity control. However, if we add the table that caused the error to the import, it will complete without errors:
SELECT partial_import('host=<external_db_adderss> user=<external_db_user> dbname=<external_db_name>', ARRAY['type_a', 'type_b', 'many2many_ab', 'accounts']::text[]);
The import is resistant to inconsistencies in the physical storage of attributes, inconsistencies in their quantity, their order, but ends with an error at the table verification stage. Let's run the following script on the local database for the accounts table:
ALTER TABLE public.accounts ADD amount2 numeric;
ALTER TABLE public.accounts DROP COLUMN amount;
ALTER TABLE public.accounts RENAME COLUMN amount2 TO amount;
ALTER TABLE public.accounts ADD CONSTRAINT amount_is_positive CHECK ((amount >= 0.0));
After executing it, the declarative table declarations on the external and local servers look identical, but an attempt to execute the import command:
SELECT partial_import('host=<external_db_adderss> user=<external_db_user> dbname=<external_db_name>', ARRAY['accounts']::text[]);
will result in an error:
ERROR: Remote table public.accounts2 has 4 columns, but local one has 5. It is possible that some attributes in one or both of the tables have been deleted earlier.
SQL state: XX000