Backups: Difference between revisions

From TRACC Wiki
Jump to navigation Jump to search
(Created page with " == Backups == We do not operate traditional backup hardware at TRACC. This is mostly because tape backups (or even traditional hard drive backups) would take weeks to restore...")
 
No edit summary
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:


== Backups ==
We do not operate traditional backup hardware at TRACC. This is mostly because tape backups (or even traditional hard drive backups) would take weeks to restore onto our primary Lustre file system, which has a capacity of 480TB and may be expanded in the future as needed. To be able to recover much more efficiently from a major disaster, we are operating a secondary standby Lustre file system of 360TB (also potential subject to future expansion). New Lustre features allow us to create a snapshot of the primary file system at regular intervals (e.g. a few times per week). The data on the snapshot is then synchronized to the standby Lustre file system over the course of a day or two, ensuring that this secondary copy of all user data is perfectly consistent as of a specific time when the original snapshot was taken.


The snapshots on the primary file system stay around for a limited time period, and can be used if recent user data needs to be recovered (after a user accidentally deletes important files). The snapshots will not allow recovery if the file system encounters a fatal error. This is highly unlikely due to triple redundancy on the Lustre file systems. 3 out of 12 disks in each of the storage groups are allowed to fail without causing data loss, and failing disks are a normal operating condition and are simply replaced and rebuilt on the fly during normal operations (this takes at most a day and users will not notice this). We may shut down file system operations when 2 disks are failing at the same time to have an extra margin of safety while rebuilding the underlying storage. This should be a very rare event and has not been encountered in the past 15 years of operating similar systems. But data integrity is of utmost importance for our operations.
We do not operate traditional backup hardware at TRACC. This is mostly because tape backups (or even traditional hard drive backups) would take weeks to restore onto our primary Lustre file system, which has a capacity of 1000 TB and may be expanded in the future as needed. To be able to recover much more efficiently from a major disaster, we are operating a secondary standby Lustre file system of of the same size (also potential subject to future expansion). New Lustre features allow us to create a snapshot of the primary file system at regular intervals (e.g. once or twice per week). The data on the snapshot is then synchronized to the standby Lustre file system over the course of a day or two, ensuring that this secondary copy of all user data is perfectly consistent as of a specific time when the original snapshot was taken.
 
The snapshots on the primary file system stay around for a limited time period, and can be potentially used if recent user data needs to be recovered (after a user accidentally deletes important files). The snapshots will not allow recovery if the file system encounters a fatal error. This is highly unlikely due to triple redundancy on the Lustre file systems. 3 out of 12 disks in each of the storage groups are allowed to fail without causing data loss, and failing disks are a normal operating condition and are simply replaced and rebuilt on the fly during normal operations (this takes at most a day and users will not notice this). We may shut down file system operations when 2 disks are failing at the same time to have an extra margin of safety while rebuilding the underlying storage. This should be a very rare event and has not been encountered in the past 15 years of operating similar systems. But data integrity is of utmost importance for our operations.


Should we encounter a fatal error with the primary Lustre file system despite all the above mentioned protective measures, we can simply move the standby file system into production (losing possibly a few days' worth of user data since the last synchronization). The standby system can then be replicated onto a fresh primary Lustre file system, with the file systems being swapped out again after a few weeks when this rebuild synchronization is finalized. We do not expect this to ever happen, but we need to be prepared for speedy disaster recovery due to the high value of the data stored on our systems.
Should we encounter a fatal error with the primary Lustre file system despite all the above mentioned protective measures, we can simply move the standby file system into production (losing possibly a few days' worth of user data since the last synchronization). The standby system can then be replicated onto a fresh primary Lustre file system, with the file systems being swapped out again after a few weeks when this rebuild synchronization is finalized. We do not expect this to ever happen, but we need to be prepared for speedy disaster recovery due to the high value of the data stored on our systems.


Currently, the standby file system is operating in the same physical space as the cluster itself. We are considering to place the standby file system physically into a remote location (very likely the Argonne Enterprise Data Center) to furthermore decrease the potential for catastrophic data loss. The feasibility for sufficient network bandwidth will be evaluated first.
Currently, the standby file system is operating in the same physical space as the cluster itself. We are considering to place the standby file system physically into a remote location (very likely the Argonne Enterprise Data Center) to furthermore decrease the potential for catastrophic data loss. The feasibility for sufficient network bandwidth will be evaluated first.

Latest revision as of 20:01, December 4, 2023


We do not operate traditional backup hardware at TRACC. This is mostly because tape backups (or even traditional hard drive backups) would take weeks to restore onto our primary Lustre file system, which has a capacity of 1000 TB and may be expanded in the future as needed. To be able to recover much more efficiently from a major disaster, we are operating a secondary standby Lustre file system of of the same size (also potential subject to future expansion). New Lustre features allow us to create a snapshot of the primary file system at regular intervals (e.g. once or twice per week). The data on the snapshot is then synchronized to the standby Lustre file system over the course of a day or two, ensuring that this secondary copy of all user data is perfectly consistent as of a specific time when the original snapshot was taken.

The snapshots on the primary file system stay around for a limited time period, and can be potentially used if recent user data needs to be recovered (after a user accidentally deletes important files). The snapshots will not allow recovery if the file system encounters a fatal error. This is highly unlikely due to triple redundancy on the Lustre file systems. 3 out of 12 disks in each of the storage groups are allowed to fail without causing data loss, and failing disks are a normal operating condition and are simply replaced and rebuilt on the fly during normal operations (this takes at most a day and users will not notice this). We may shut down file system operations when 2 disks are failing at the same time to have an extra margin of safety while rebuilding the underlying storage. This should be a very rare event and has not been encountered in the past 15 years of operating similar systems. But data integrity is of utmost importance for our operations.

Should we encounter a fatal error with the primary Lustre file system despite all the above mentioned protective measures, we can simply move the standby file system into production (losing possibly a few days' worth of user data since the last synchronization). The standby system can then be replicated onto a fresh primary Lustre file system, with the file systems being swapped out again after a few weeks when this rebuild synchronization is finalized. We do not expect this to ever happen, but we need to be prepared for speedy disaster recovery due to the high value of the data stored on our systems.

Currently, the standby file system is operating in the same physical space as the cluster itself. We are considering to place the standby file system physically into a remote location (very likely the Argonne Enterprise Data Center) to furthermore decrease the potential for catastrophic data loss. The feasibility for sufficient network bandwidth will be evaluated first.