Fix return code when backup to remote rgw fails

In the database backup framework (_backup_main.sh.tpl), the backup_databases function exits with code 1 if the store_backup_remotely function fails to send the backup to the remote RGW. This causes the pod to fail and be restarted by the cronjob, over and over until the backoff retries limit (6 by default) is reached, so it creates many copies of the same backup on the file system, and the default k8s behavior is to delete the job/pods once the backoff limit has been exceeded, so it then becomes more difficult to troubleshoot (although we may have logs in elasticsearch). This patch changes the return code to 0 so that the pod will not fail in that scenario. The error logs generated should be enough to flag the failure (via Nagios or whatever alerting system is being used). Change-Id: Ie1c3a7aef290bf6de4752798821d96451c1f2fa5
2020-06-30 16:22:08 +00:00 · 2020-06-30 16:22:08 +00:00 · 1508324ce7
parent b1e66fd308
commit 1508324ce7
1 changed files with 4 additions and 1 deletions
--- a/helm-toolkit/templates/scripts/db-backup-restore/_backup_main.sh.tpl
+++ b/helm-toolkit/templates/scripts/db-backup-restore/_backup_main.sh.tpl
@ -346,7 +346,10 @@ backup_databases() {
      echo "Backup archive size: $ARCHIVE_SIZE"
      echo "=================================================================="
      set -x
-      exit 1
+      # Because the local backup was successful, exit with 0 so the pod will not
+      # continue to restart and fill the disk with more backups. The ERRORs are
+      # logged and alerting system should catch those errors and flag the operator.
+      exit 0
    fi

    #Only delete the old archive after a successful archive