Deduplicated Backups

ulayer · February 20, 2020, 4:57am

BorgBackup is the best, we do VM snapshots, then borg sends them off-site and encryption happens before it goes over the wire and onto a filesystem. It’s very configurable so you can tweak it to your liking.

We also sync all the borg repos to Backblaze every week for extra protection

mfs · February 20, 2020, 8:11am

borg is love
borg is life
it fits nearly every usage case; duplicati is nice too and it has some clients for Windows too whilst borgbackup is mostly a *nix CLI tool (abstraction layers like borgmatic are CLI-oriented as well). WebUI and MacOS/Linux frontends do exist for borg too but I haven’t played with them

If you aren’t looking for a Windows tool that won’t need cygwin/WSL, borgbackup would be my first choice as well

beagle · February 20, 2020, 9:53am

for Windows I run AOMEI Backupper for full disk backups. It has a free version that has most features including encryption and differential backups.
for Mac I run Time Machine, also encrypted.
I also run file based backups from these clients to a CentOS server where I run borgbackup daily to 3 off-site repos.

Ympker · February 20, 2020, 10:11am

Without googling tl;dr: What are deduplicated backups? Backups that are not duplicated? What? Never heard the term.

Since people just seem to recommend backup software, Aomei has served me well.

beagle · February 20, 2020, 10:18am

The backup software verifies whether the block or file already exists on the backup destination before storing it. It saves storage space.

mfs · February 20, 2020, 11:11am

beagle already replied so I’d only add an excerpt of a systemd backup cron job I run at night on a box of mine, as an example

feb 20 00:36:21 borg[1312]: Archive name: mount-2020-02-20T00:34:46                                                                                                    
feb 20 00:36:21 borg[1312]: Archive fingerprint: daf59bfe21880e9f13467e967bc9f9af151cbbe4f1d9a2f3a40b4666fcfcd810    
feb 20 00:36:21 borg[1312]: Time (start): Thu, 2020-02-20 00:34:48                                                                                                                                           feb 20 00:36:21 borg[1312]: Time (end):   Thu, 2020-02-20 00:36:18    
feb 20 00:36:21 borg[1312]: Duration: 1 minutes 30.02 seconds                                                                                                                                                feb 20 00:36:21 borg[1312]: Number of files: 92146    
feb 20 00:36:21 borg[1312]: Utilization of max. archive size: 0%                                                                                                                                             feb 20 00:36:21 borg[1312]: ------------------------------------------------------------------------------    
feb 20 00:36:21 borg[1312]:                        Original size      Compressed size    Deduplicated size
feb 20 00:36:21 borg[1312]: This archive:               84.64 GB             61.02 GB            193.55 MB
feb 20 00:36:21 borg[1312]: All archives:                4.16 TB              2.72 TB             53.66 GB
feb 20 00:36:21 borg[1312]:                        Unique chunks         Total chunks
feb 20 00:36:21 borg[1312]: Chunk index:                  104036              5739196
feb 20 00:36:21 borg[1312]: ------------------------------------------------------------------------------
feb 20 00:36:21 systemd[1]: [email protected]: Succeeded.

As you can see in this example, if dedup and compression weren’t in place I’d need 4.16 TB of space. If we double check on the destination server the folder used by borgbackup for this repo,

$ du --si
54G     ./data/0
54G     ./data
54G     .

deduplication is also an optional/experimental feature of some filesystems, XFS and BTRFS for example. There’s also VDO, promoted lately by RH, but it has its caveats and it’s offtopic I guess

For Windows I just use qcow2 snapshots…

flips · February 20, 2020, 11:32am

But if I want to backup data off a shared host, accound in CageFS, Borg etc. would require binaries both there and at the target/remote host.(?)

(Currently using rsync, but having no deduplication functionality.)

Also planning on mirroring backup to a webdav. Haven’t looked into that yet.

mfs · February 20, 2020, 12:06pm

It would be better, even if it’s possible to work around it with sshfs. The borgbackup docs cover this case

Ympker · February 20, 2020, 1:43pm

So kinda incremental?

Ympker · February 20, 2020, 1:44pm

Oh, that’s actually good to give an example. I think I kinda got it now. Thanks

Wolveix · February 20, 2020, 1:50pm

Essentially. If your entire backup equates to 12GB each day but 10GB of those files haven’t changed since the last backup, then they won’t be backed up again. So your total backup space would use 14GB after 2 days, rather than 24GB.

Ympker · February 20, 2020, 2:16pm

So is it the same like incremental or still kinda different?

Wolveix · February 20, 2020, 2:32pm

Sorry, I should’ve given a better explanation aha. I just explained incremental backups Let’s say you have 2 blocks which both contain the same executable or text file or whatever. Despite having that file twice on your origin, when backed up, only a single copy will be physically stored while the other will be linked (in a sense, I’m trying to keep things simple) So rather than storing the same file twice, it’s only stored once which saves on space.

elliotc · February 20, 2020, 2:32pm

I guess it works like git? I can version control it?

mfs · February 20, 2020, 2:33pm

If a backup is incremental, then only differences from the last backup are considered; if a backup, or a storage device, or a filesystem is deduplicated, then duplicated blocks or files are found and we’re storing only one copy, plus multiple references to that one copy, as needed.

Let’s say you have 10 copies of the same photo in a folder, and each photo occupies 10MB. If you’re performing a plain backup of your folder, then you’re storing 10 copies of that photo on some remote storage: you’re going to need at least 100MB on that device.
Assume you’re adding a different photo to that folder, occupying 15 MB.
If we’re still performing a plain backup and we want to save both the previous and current state, you’ll need 100MB+115MB.
With an incremental backup, instead, we’re going to check if there are some differences from the last time we have performed a backup. So, we’re going to need 100MB+15MB.
Now, meet dedup.
If we perform a deduplicated incremental backup, we’re going to need just 10MB+15MB.

Now, dedup implementations differ here and there, anyway borg splits each file in chunks and compares chunks within the same repo, even if the same repo hosts different machines. filename and timestamps are not considered

EDIT: Wolveix’d

Ympker · February 20, 2020, 2:35pm

That sounds pretty cool tbh. Thanks for the explanation. Sorry to always ask again but it wasn’t quite clear to me until now.

Ympker · February 20, 2020, 2:38pm

This sounds pretty cool. I’ll definitely look more into this. I mean, storage is cheap AF but especially when you think bigger (the likes of Plex) this would be very interesting I can imagine. Thank you for taking the time to break it to me

Wolveix · February 20, 2020, 2:45pm

Since Linux ISOs are typically unique files, deduplication doesn’t really have any effect

Ympker · February 20, 2020, 2:52pm

Oh, so it’s just good for windows? Well still, I also have a local windows plex server (aside from my RPi server).

Wolveix · February 20, 2020, 2:53pm

I’m using Linux ISOs as a replacement for what Plex servers actually ingest Deduplication works regardless of the platform, but it won’t have any effect if all files are entirely unique (which most, if not all media files are).