I have written a python program to parallel restore greenplum backup in dissimilar number of segment instances. This script has all required steps mentioned in this pivotal documentation.

Problem

  • Restoring backup in DCA with different configuration
    • Old method
      • Restoring pg_dump backup (Single Thread)
      • ~350 Gb/Hour (Max)
    • Thoughput
      • Backup size: 29TB
      • Restore time: ~84 Hours

Solution

  • Taking parallel backup using gpcrondump
  • Restoring multiple backup files parallel
    • 1.8 TB / Hour
    • Backup Size: 29 TB
    • Restore time: ~16 Hours

Point to note

  • We should have all the uncompressed backup files in single directory. (Data Domain is best option for this)

Testing

For testing this script, I created a sample function. This function takes two arguments. One is string and another is integer. As you might have understood this function internally runs pg_sleep function on second(integer) argument.

Funtion

1
2
3
4
5
6
7
8
CREATE FUNCTION test_pg_sleep(TEXT, INTEGER) RETURNS VOID AS $$
DECLARE
    name ALIAS for $1;
    delay ALIAS for $2;
BEGIN
    PERFORM pg_sleep(delay);
END;
$$ LANGUAGE 'plpgsql' STRICT;

Backup files

Below are the backup files which contains statements for above function statements. I have given various values in function to sleep.

[~/sample_dumps] [0]
$ ls
gp_dump_-1_1_1234567890           gp_dump_2_10_1234567890           
gp_dump_2_2_1234567890            gp_dump_2_4_1234567890            
gp_dump_2_6_1234567890            gp_dump_2_8_1234567890
gp_dump_-1_1_1234567890_post_data gp_dump_2_1_1234567890            
gp_dump_2_3_1234567890            gp_dump_2_5_1234567890            
gp_dump_2_7_1234567890            gp_dump_2_9_1234567890
[~/sample_dumps] [0]
$ cat *
select test_pg_sleep('master_file',20)

select test_pg_sleep('post_data',10)

select test_pg_sleep('segment_10_file',50)

select test_pg_sleep('segment_1_file',5)

select test_pg_sleep('segment_2_file',10)

select test_pg_sleep('segment_3_file',15)

select test_pg_sleep('segment_4_file',20)

select test_pg_sleep('segment_5_file',25)

select test_pg_sleep('segment_6_file',30)

select test_pg_sleep('segment_7_file',35)

select test_pg_sleep('segment_8_file',40)

select test_pg_sleep('segment_9_file',45)

Running Script

Below are the two sessions, In first session I’m running restore script and in second session I’m monitoring pg_stat_activity table.

Session 1:

python2.7 parallel_restore.py -d icprod -t 1234567890 -u sample_dumps/ -p 6

Session 2:

$ while True
> do
> sleep 1
> psql icprod -c "SELECT pid,query,now() - query_start as Query_Duration from pg_stat_activity where query not ilike '%pg_stat%';"
> done

Demo

Below is live recording of above two sessions. You may want to click on fullscreen for better view.

Help

[gpadmin@mdw ~]$ ./parallel_restore.py --help
Usage:
python2 paralled_restore.py -d <database_name> -u <backup_files_directory> -t <dump_key> -p <number of parallel processes

Options:
  -h, --help            show this help message and exit
  -d DATABASE, --database=DATABASE
                        Specify target database to restore backup
  --host=HOST           Specify the target host
  -t TIMESTAMP, --timestamp=TIMESTAMP
                        Specify the timestamp of backup
  -p PARALLEL, --parallel-processes=PARALLEL
                        Specify number of parallel-processes
  -u DIRECTORY, --directory=DIRECTORY
                        Specify Backup directory

-d, --database

This option is to specify the target database. If target database doesn’t exist in the environment, Script exits immediately.

--host

This option is to specify the target host. The default value is localhost

-t, --timestamp

This option is to specify the backup key generated by gpcrondump utility. This helps to fetch the backup file list.

-p, --parallel

This option is to specify the number of parallel processes to run. The default value is 1.

-u, --directory

This is to specify the backup files directory.

Logging

This script stores its logs in /home/gpadmin/gpAdminLogs and it generates multiple log files.

parallel_restore_<date>.log

This is main log file which stores script progress

parallel_restore_master_<dumpkey>_<date>.log

This is standard log files of master backup restore

parallel_restore_master_<dumpkey>_<date>.error

This is error log file of master backup file restore. It is always recommended to check this log after restore.

parallel_restore_<dump_key>_<date>.log

This is standard log file for segment backup file restore.

parallel_restore_<dump_key>_<date>.error

This is error log file for segment backup file restore. It is always recommended to check this log after restore.

parallel_restore_post_data.log

This is standard log file for post data backup file

Note:

I’m not sharing this script here and I have my reasons for it. I’ll try to share it here ASAP.