Restauration WAL

Christof25 · 10/07/2019 09:47:42

Bonjour à tous.

Je dois réaliser une restauration d'une base POSTGRES en version 10.5 sous docker avec des playbooks Ansible.

J'arrive déjà à restaurer à partir d'une FULL dont voici les étapes :

== Sauvegarde

1.) Sauvegarde à chaud de la base

command: su postgres -c 'pg_basebackup -h postgres -p 5432 -U {{ONL_POSTGRES_USER}} -D /var/lib/postgresql/last_save -Ft [/*]-z -P -v -Xf '

2.) Fichier en sortie
20190709T200010_FULL_postgres_save.tar.gz

== Restauration

1.) Arrêt de la base

2.) Décompression de l'archive

tar -xvC {{UNXRESTORE}}/20190709T200010_FULL_postgres_save -f {{UNXRESTORE}}/20190709T200010_FULL_postgres_save.tar.gz ; chmod -R 777 {{UNXRESTORE}}/*

3.) Restauration

command: sh -c 'rm -rf /var/lib/postgresql/data/* ; mv {{UNXRESTORE}}/20190709T200010_FULL_postgres_save/* /var/lib/postgresql/data ; chown -R postgres:postgres /var/lib/postgresql/data ; rm -f /var/lib/postgresql/data/recovery.conf'

4.) Démarrage
Le démarrage est OK et l'application est opérationnelle.

Par contre, j'ai des fichiers WAL à restaurer (ils sont générés après le FULL)

== Sauvegarde

command: su postgres -c 'cd {{UNXSAVE}}/{{SAVE_RETENTION}}tmp/{{SAVE_SUBDIR}}/{{environment_type}}/arch ; tar -czvf {{UNXBACKUP}}/{{time_save}}_{{backup_file_prefix}}_postgres_save.tar.gz ./* ; chmod 777 {{UNXBACKUP}}/{{time_save}}_{{backup_file_prefix}}_postgres_save.tar.gz'

Voici le contenu d'une archive WAL :

bash-4.4# mkdir 20190710T000011_WAL_postgres_save
bash-4.4# tar -xvC ./20190710T000011_WAL_postgres_save -f ./20190710T000011_WAL_postgres_save.tar.gz
./0000000100000006000000D7
./0000000100000006000000D8
./0000000100000006000000D9

bash-4.4# ls -altr ./20190710T000011_WAL_postgres_save
total 49376
-rw-------    1 root     root      16777216 Jul  9 19:00 0000000100000006000000D7
-rw-------    1 root     root      16777216 Jul  9 20:00 0000000100000006000000D8
-rwx------    1 root     root      16777216 Jul  9 21:00 0000000100000006000000D9
drwxrwxrwx    4 nobody   nobody        4096 Jul 10 07:38 ..
drwxr-xr-x    2 root     root          4096 Jul 10 07:39 .

Voici ma question :
Comment rejouer ces WAL après ma restauration d'une FULL?
Dois-je les déplacer dans un dossier précis et lancer une commande du type 'pg_standby' ?

Voici les dossiers de ma base

bash-4.4# cd /var/lib/postgresql/data
bash-4.4# ls -altr
total 148
-rwxrwxrwx    1 postgres postgres        88 Mar  8 17:07 postgresql.auto.conf
drwxrwxrwx    2 postgres postgres      4096 Mar  8 17:07 pg_twophase
drwxrwxrwx    2 postgres postgres      4096 Mar  8 17:07 pg_tblspc
drwxrwxrwx    2 postgres postgres      4096 Mar  8 17:07 pg_snapshots
drwxrwxrwx    2 postgres postgres      4096 Mar  8 17:07 pg_serial
drwxrwxrwx    2 postgres postgres      4096 Mar  8 17:07 pg_replslot
-rwxrwxrwx    1 postgres postgres      1636 Mar  8 17:07 pg_ident.conf
drwxrwxrwx    2 postgres postgres      4096 Mar  8 17:07 pg_dynshmem
drwxrwxrwx    2 postgres postgres      4096 Mar  8 17:07 pg_commit_ts
-rwxrwxrwx    1 postgres postgres         3 Mar  8 17:07 PG_VERSION
-rwxrwxrwx    1 postgres postgres       178 Apr  3 19:00 recovery.conf.NOK
-rwxrwxrwx    1 postgres postgres     30591 Jun  6 12:21 postgresql.conf
-rwxrwxrwx    1 postgres postgres      5000 Jun  6 12:21 pg_hba.conf
drwxrwxrwx    2 postgres postgres      4096 Jul  2 13:56 pg_stat
-rwxrwxrwx    1 postgres postgres         0 Jul  8 18:00 tablespace_map.old
-rwxrwxrwx    1 postgres postgres       209 Jul  8 18:00 backup_label.old
drwxrwxrwx    2 postgres postgres      4096 Jul  9 13:58 pg_xact
drwxrwxrwx    7 postgres postgres      4096 Jul  9 13:59 base
drwxrwxrwx    4 postgres postgres      4096 Jul  9 13:59 pg_multixact
-rw-------    1 postgres postgres        24 Jul  9 14:00 postmaster.opts
drwxrwxrwx    2 postgres postgres      4096 Jul  9 14:00 pg_notify
-rw-------    1 postgres postgres        94 Jul  9 14:00 postmaster.pid
drwxrwxrwx    2 postgres postgres      4096 Jul  9 14:01 global
drwxrwxrwx    2 postgres postgres      4096 Jul  9 14:10 pg_subtrans
-rw-r-----    1 postgres postgres        64 Jul  9 22:00 current_logfiles
drwx------   19 postgres postgres      4096 Jul  9 22:00 .
drwxrwxrwx    3 postgres postgres      4096 Jul 10 04:20 pg_wal
drwxrwxrwx    4 postgres postgres      4096 Jul 10 04:20 pg_logical
drwxr-xr-x    1 postgres postgres      4096 Jul 10 06:36 ..
drwxrwxrwx    2 postgres postgres      4096 Jul 10 07:42 pg_stat_tmp

Merci de votre aide.

rjuju · 10/07/2019 09:55:04

Bonjour,

pg_standby est une commande qui existe principalement pour faire de la réplication sur les versions de postgres ne disposant pas d'un paramètre standby_mode. Si votre besoin est de rejouer tous les WAL présent dans le répertoire où vous décompressez votre archive, il vous faut utiliser le paramètre restore_command, voir https://www.postgresql.org/docs/10/cont … R-RECOVERY.

Christof25 · 10/07/2019 11:08:43

Merci pour le tuyau.
En fait, si j'ai bien compris,
1.) je supprime tous les fichiers présents dans /var/lib/postgresql/data/pg_wal
2.) je dépose dans ce même dossier tous les WAL à rejouer.
3.) je créé un fichier recovery.conf qui sera pris en compte en démarrant la base

Par contre, dans ce fichier, j'ai un peu mal à le remplir.
Dans mon cas, est-ce que cela suffira?

restore_command = 'cp /var/lib/postgresql/data/pg_wal/%f "%p"'

ou dois-je renseigner un autre dossier temporaire dans lequel j'aurai décompressé le fichier compressé contenant mes WAL?

rjuju · 10/07/2019 11:37:06

1.) je supprime tous les fichiers présents dans /var/lib/postgresql/data/pg_wal

Oui, le répertoire devrait de toutes façons être vide si vous utilisez pg_basebackup.

2.) je dépose dans ce même dossier tous les WAL à rejouer.

Le restore_command est justement là pour effectuer ce travail

3.) je créé un fichier recovery.conf qui sera pris en compte en démarrant la base

Oui

ou dois-je renseigner un autre dossier temporaire dans lequel j'aurai décompressé le fichier compressé contenant mes WAL?

Oui, vous devez renseigner le répertoire où sont situés les WAL sauvegardés. Si tous les WAL sont manuellement copiés dans le répertoire pg_wal avant de démarrer l'instance, cela devrait fonctionner aussi, mais ce n'est pas le fonctionnement recommandé.

Christof25 · 24/07/2019 08:37:44

Bonjour

J'ai pu (enfin) refaire des tests mais il y a encore quelque chose qui 'coince'....

WAL restorés

bash-4.4# ls -altr /users/onl00/restore/WAL
total 131656
-rwxrwxrwx    1 root     root      16777216 Jul 23 19:00 000000010000000700000067
-rwxrwxrwx    1 root     root      16777216 Jul 23 20:00 000000010000000700000068
-rwxrwxrwx    1 root     root      16777216 Jul 23 21:00 000000010000000700000069
-rwxrwxrwx    1 root     root      16777216 Jul 23 22:00 00000001000000070000006A
-rwxrwxrwx    1 root     root      16777216 Jul 24 02:00 00000001000000070000006B
-rwxrwxrwx    1 root     root      16777216 Jul 24 03:00 00000001000000070000006C
-rwxrwxrwx    1 root     root      16777216 Jul 24 04:00 00000001000000070000006D
-rwxrwxrwx    1 root     root      16777216 Jul 24 05:00 00000001000000070000006E

Le fichier recovery.conf est bien en place

bash-4.4# cat /var/lib/postgresql/data/recovery.conf
restore_command = 'cp /users/onl00/restore/WAL/%f "%p"'

Première tentative de restauration mais échec au démarrage....

bash-4.4# cat postgresql-2019-07-24_081032.log
2019-07-24 08:10:33 CEST [19-1] LOG:  database system was interrupted; last known up at 2019-07-23 20:00:13 CEST
2019-07-24 08:10:33 CEST [19-2] LOG:  starting archive recovery
cp: can't stat '/users/onl00/restore/WAL/000000010000000700000066': No such file or directory
2019-07-24 08:10:33 CEST [19-3] LOG:  invalid checkpoint record
2019-07-24 08:10:33 CEST [19-4] FATAL:  could not locate required checkpoint record
2019-07-24 08:10:33 CEST [19-5] HINT:  If you are not restoring from a backup, try removing the file "/var/lib/postgresql/data/backup_label".
2019-07-24 08:10:33 CEST [1-6] LOG:  startup process (PID 19) exited with exit code 1
2019-07-24 08:10:33 CEST [1-7] LOG:  aborting startup due to startup process failure
2019-07-24 08:10:33 CEST [1-8] LOG:  database system is shut down

bash-4.4# ls /users/onl00/restore/WAL/000000010000000700000066
ls: /users/onl00/restore/WAL/000000010000000700000066: No such file or directory

Je vérifie où ce fichier WAL est appelé :

bash-4.4# grep 000000010000000700000066 *
backup_label:START WAL LOCATION: 7/66000028 (file 000000010000000700000066)

bash-4.4# cat backup_label
START WAL LOCATION: 7/66000028 (file 000000010000000700000066)
CHECKPOINT LOCATION: 7/66000060
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2019-07-23 20:00:13 CEST
LABEL: pg_basebackup base backup

bash-4.4# ls -altr backup_label
-rwxrwxrwx    1 postgres postgres       209 Jul 23 18:00 backup_label

Du coup, je supprime backup_label et j'essaie de démarrer ma base :

bash-4.4# cat postgresql-2019-07-24_081548.log
2019-07-24 08:15:48 CEST [20-1] LOG:  database system was interrupted; last known up at 2019-07-23 20:00:13 CEST
2019-07-24 08:15:48 CEST [20-2] LOG:  starting archive recovery
2019-07-24 08:15:48 CEST [20-3] LOG:  ignoring file "tablespace_map" because no file "backup_label" exists
2019-07-24 08:15:48 CEST [20-4] DETAIL:  File "tablespace_map" was renamed to "tablespace_map.old".
2019-07-24 08:15:48 CEST [20-5] LOG:  invalid primary checkpoint record
2019-07-24 08:15:48 CEST [20-6] LOG:  invalid secondary checkpoint record
2019-07-24 08:15:48 CEST [20-7] PANIC:  could not locate a valid checkpoint record
2019-07-24 08:15:48 CEST [1-6] LOG:  startup process (PID 20) was terminated by signal 6
2019-07-24 08:15:48 CEST [1-7] LOG:  aborting startup due to startup process failure
2019-07-24 08:15:48 CEST [1-8] LOG:  database system is shut down

Je décide donc d'effectuer un resetwal pour rendre le TP disponible en attendant d'avoir votre analyse.

bash-4.4# su postgres
~/log $ pg_resetwal -f /var/lib/postgresql/data
Write-ahead log reset

Mais problème au démarrage....

bash-4.4# cat postgresql-2019-07-24_081721.log
2019-07-24 08:17:21 CEST [20-1] LOG:  database system was shut down at 2019-07-24 08:17:05 CEST
2019-07-24 08:17:21 CEST [20-2] LOG:  starting archive recovery
cp: can't stat '/users/onl00/restore/WAL/000000010000000700000067': No such file or directory
2019-07-24 08:17:21 CEST [20-3] WARNING:  WAL was generated with wal_level=minimal, data may be missing
2019-07-24 08:17:21 CEST [20-4] HINT:  This happens if you temporarily set wal_level=minimal without taking a new base backup.
2019-07-24 08:17:21 CEST [20-5] FATAL:  hot standby is not possible because wal_level was not set to "replica" or higher on the master server
2019-07-24 08:17:21 CEST [20-6] HINT:  Either set wal_level to "replica" on the master, or turn off hot_standby here.
2019-07-24 08:17:21 CEST [1-6] LOG:  startup process (PID 20) exited with exit code 1
2019-07-24 08:17:21 CEST [1-7] LOG:  aborting startup due to startup process failure
2019-07-24 08:17:21 CEST [1-8] LOG:  database system is shut down

En désespoir de cause, je supprime alors le fichier recovery.conf et la base a pu démarrer, donc à partir du FULL et sans les WAL.

Merci de votre aide.

Dernière modification par Christof25 (24/07/2019 08:38:59)

rjuju · 24/07/2019 09:44:55

Je ne sais pas comment vous en êtes arrivé à effectuer ces différents étapes, mais vous avez pris les pires décisions possibles en vous assurant de définitivement corrompre votre instance restaurée.

Je vous conseille de lire attentivement la documentation sur les sauvegarde physiques : https://www.postgresql.org/docs/current … iving.html ou de choisir un des outils de sauvegardes fournis par la communauté.

Christof25 · 24/07/2019 09:50:18

Tout simplement en suivant vos conseils..... lol
Bref, je vais consulter votre document concernant les WAL.

Marc Cousin · 24/07/2019 09:56:39

Je doute que rjuju vous ait dit de lancer un resetwal, dont la doc dit explicitement "It should be used only as a last resort, when the server will not start due to such corruption."

Christof25 · 24/07/2019 10:04:49

oui, on est bien d'accord, mais c'était la seule commande que j'ai pu trouver par moi-même pour repartir d'une FULL uniquement et de pouvoir démarrer ma base.
Après, je peux retenter autant de fois une restauration des WAL, si vous voyez quelque chose que j'aurai 'loupé' .

En tout cas, j'apprécie beaucoup votre temps consacré à m'aider ;-)

rjuju · 24/07/2019 10:19:27

Dans les problèmes que je vois :

- il manque des WAL, l'archivage de ceux-ci pose à priori problème
- "WARNING: WAL was generated with wal_level=minimal, data may be missing" est incompréhensible, vous ne devriez pas pouvoir faire un pg_basebackup avec un wal_level à minimal. Soit vous n'utilisez pas pg_basebackup, soit les WAL restaurés ne sont pas les bons

La documentation couvre ces points.

Christof25 · 31/07/2019 09:54:10

Bonjour

J'ai voulu restaurer une nouvelle fois. Voici l'erreur de démarrage de ma base :

bash-4.4# cat postgresql-2019-07-31_094204.log
2019-07-31 09:42:04 CEST [20-1] LOG:  database system was interrupted; last known up at 2019-07-30 20:00:23 CEST
2019-07-31 09:42:04 CEST [21-1] [unknown]@[unknown] LOG:  connection received: host=opencell-00.onl00_opencell port=54880
2019-07-31 09:42:04 CEST [21-2] [unknown]@[unknown] LOG:  incomplete startup packet
2019-07-31 09:42:04 CEST [22-1] [unknown]@[unknown] LOG:  connection received: host=opencell-00.onl00_opencell port=54882
2019-07-31 09:42:04 CEST [20-2] LOG:  starting archive recovery
cp: can't stat '/users/onl00/restore/WAL/0000000100000007000000A3': No such file or directory
2019-07-31 09:42:04 CEST [20-3] LOG:  invalid checkpoint record
2019-07-31 09:42:04 CEST [20-4] FATAL:  could not locate required checkpoint record
2019-07-31 09:42:04 CEST [20-5] HINT:  If you are not restoring from a backup, try removing the file "/var/lib/postgresql/data/backup_label".
2019-07-31 09:42:04 CEST [1-6] LOG:  startup process (PID 20) exited with exit code 1
2019-07-31 09:42:04 CEST [1-7] LOG:  aborting startup due to startup process failure
2019-07-31 09:42:04 CEST [1-8] LOG:  database system is shut down

Pourtant, le fichier WAL recherché est bien présent !

bash-4.4# ls -altr /users/onl00/restore/WAL/0000000100000007000000A3
-rwxrwxrwx    1 postgres postgres  16777216 Jul 30 18:00 /users/onl00/restore/WAL/0000000100000007000000A3

Pour vous aider, voici le contenu du recovery.conf et backup_label

bash-4.4# cat recovery.conf
restore_command = 'cp /users/onl00/restore/WAL/%f "%p"'


bash-4.4# cat backup_label
START WAL LOCATION: 7/A3000028 (file 0000000100000007000000A3)
CHECKPOINT LOCATION: 7/A3000060
BACKUP METHOD: streamed
BACKUP FROM: master
START TIME: 2019-07-30 20:00:23 CEST
LABEL: pg_basebackup base backup

Merci de votre aide

duple · 31/07/2019 10:50:49

Par curiosité le fichier WAL fait combien de mega ? Est ce qu'il est compressé ?

Christof25 · 31/07/2019 10:54:42

Non, pas compressé.

Voici la liste des WAL disponibles pour ma restauration :

bash-4.3# ls -altr
total 148112
-rwxrwxrwx    1 postgres postgres  16777216 Jul 30 20:00 0000000100000007000000A3
-rwxrwxrwx    1 postgres postgres  16777216 Jul 30 22:00 0000000100000007000000A5
-rwxrwxrwx    1 postgres postgres  16777216 Jul 30 23:00 0000000100000007000000A6
-rwxrwxrwx    1 postgres postgres  16777216 Jul 31 00:00 0000000100000007000000A7
-rwxrwxrwx    1 postgres postgres  16777216 Jul 31 04:00 0000000100000007000000A8
-rwxrwxrwx    1 postgres postgres  16777216 Jul 31 05:00 0000000100000007000000A9
-rwxrwxrwx    1 postgres postgres  16777216 Jul 31 06:00 0000000100000007000000AA
-rwxrwxrwx    1 postgres postgres  16777216 Jul 31 07:00 0000000100000007000000AB
-rwxrwxrwx    1 postgres postgres  16777216 Jul 31 08:33 0000000100000007000000A4

Ils font tous la même taille, non compressés et les droits sont OK

rjuju · 31/07/2019 10:57:35

À priori, soit le WAL n'existait pas au moment où l'instance a été démarrée, soit c'est une subtilité due à votre utilisation de docker / ansible, et il faut chercher de ce côté là.

duple · 31/07/2019 12:02:04

Et sinon, si on lance PG en mode mono utilisateur avec un certain niveau de debug çà dit quoi, le message est peut être un peu plus clair ou c'est le même que dans le log ?
postgres --single -D /path_pgdata/data -d 3 la_base

Christof25 · 31/07/2019 13:41:02

Vous savez quoi?
Merci à rjuju pour son indice :

rjuju a écrit :

soit c'est une subtilité due à votre utilisation de docker / ansible, et il faut chercher de ce côté là.

Car j'ai regardé le playbook de démarrage de mon container qui lance l'instance.
Et j'ai remarqué qu'il manquait un volume, celui qui contenait tous les WAL restorés!!! Normal qu'il ne pouvait pas les trouver....

Bref, après quelques modifications, j'ai relancé une restauration complète FULL + WAL et voilà ce que contient la log de démarrage :

bash-4.4# vi postgresql-2019-07-31_132103.log
2019-07-31 13:21:03 CEST [20-1] LOG:  database system was interrupted; last known up at 2019-07-30 20:00:23 CEST
2019-07-31 13:21:03 CEST [20-2] LOG:  starting archive recovery
2019-07-31 13:21:03 CEST [20-3] LOG:  restored log file "0000000100000007000000A3" from archive
2019-07-31 13:21:04 CEST [20-4] LOG:  redo starts at 7/A3000028
2019-07-31 13:21:04 CEST [20-5] LOG:  consistent recovery state reached at 7/A3000130
2019-07-31 13:21:04 CEST [1-6] LOG:  database system is ready to accept read only connections
2019-07-31 13:21:04 CEST [20-6] LOG:  restored log file "0000000100000007000000A4" from archive
2019-07-31 13:21:04 CEST [20-7] LOG:  record with incorrect prev-link 0/0 at 7/A4000028
2019-07-31 13:21:04 CEST [20-8] LOG:  redo done at 7/A3000130
2019-07-31 13:21:04 CEST [20-9] LOG:  restored log file "0000000100000007000000A3" from archive
cp: can't stat '/users/onl00/restore/WAL/00000002.history': No such file or directory
2019-07-31 13:21:05 CEST [20-10] LOG:  selected new timeline ID: 2
2019-07-31 13:21:05 CEST [20-11] LOG:  archive recovery complete
cp: can't stat '/users/onl00/restore/WAL/00000001.history': No such file or directory
2019-07-31 13:21:05 CEST [1-7] LOG:  database system is ready to accept connections
2019-07-31 13:21:06 CEST [34-1] [unknown]@[unknown] LOG:  connection received: host=opencell-00.onl00_opencell port=36402
2019-07-31 13:21:06 CEST [34-2] [unknown]@[unknown] LOG:  incomplete startup packet
2019-07-31 13:21:06 CEST [35-1] [unknown]@[unknown] LOG:  connection received: host=opencell-00.onl00_opencell port=36404
2019-07-31 13:21:36 CEST [36-1] [unknown]@[unknown] LOG:  connection received: host=opencell-00.onl00_opencell port=36484
2019-07-31 13:21:36 CEST [36-2] mdponlxxx@onl01 LOG:  connection authorized: user=mdponlxxx database=onl01
2019-07-31 13:21:36 CEST [36-3] mdponlxxx@onl01 LOG:  duration: 0.388 ms
2019-07-31 13:21:36 CEST [36-4] mdponlxxx@onl01 LOG:  duration: 0.021 ms
2019-07-31 13:21:36 CEST [36-5] mdponlxxx@onl01 LOG:  duration: 0.097 ms
......

Les messages concernant les fichiers .history sont-ils importants?

Mon application a démarré sans problème, sans intervenir manuellement cette fois-ci.

Merci encore pour votre aide en tout cas

Forums PostgreSQL.fr

#1 10/07/2019 09:47:42