Backups: Implementing Zstandard compression instead of Gzip?

Hello everyone,

I recently discovered Zstandard (later abbreviated to zdst) compression, which is an extremely fast (almost realtime) compression algorithm with decent compression performance.
I have a proposal concerning Yunohost backup archives.

Proposal

In Yunohost backups, replace (Tar)Gzip archives by Zstd archives.

Advantages

  • (Several times) Faster backups
  • (Several times) Faster restore
  • Slightly smaller backups
  • Potentially less CPU load (it ends faster)

Drawbacks

  • This new format might confuse some users (who may need to learn new commands, even if there are very similar to gzip commands)
  • Would it broke some app scripts ? Custom user scripts ? External tools ?

Motivation

Archiving and decompressing backups in Yunohost takes time, especially on unpowerful hardware (such as raspberry pi).
Yunohost still use gzip algorithm, which is now a lot slower than other standard algorithms, and not that good in terms of compression ratio. Using a more modern and efficient algorithm could provide our users faster backup&restore, while using less disk space.

Technical details - and why choose Zstandard

Compared to Gzip, according to this benchmark https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/ (privacy warning: facebook related link), it has a similar compression ratio but compresses about 5 times faster and decompresses about 3.5~4 times faster.
This benchmark also compares it with other common algorithms, such has lz4, zlib, xz.
Short summary: lz4 is lighting fast, but worse in compression ratio and zsdt if fast enough (around 7s/GB) ; zlib is slower ; xz compress more but is almost 100 times slower.

Exemple: a 291MB backup from my Yunohost gives a 68.6MB tar.gzip (after ~10 seconds), a 52.7MB tar.xz (in ~1min), and a 62.8MB tar.zst (after ~1.5s, 10% smaller than gzip).
NB: this test was done with a fast desktop computer, with fast CPU and storage. The times would be a lot longer on most of our users hardware (raspberry pi, …).

Also Zstandard seems to be supported in all modern distributions, yet I’m unsure of its support status on Debian.

Which compression level is better ?

Here is a summary based on personnel benchmarks and some internet research (I can provide details if needed).

  • Compression ratio is very similar between most compression levels (from 1 to 22). A significant bonus is observed for levels higher than 10~15, but the speed cost is very high for little gain. I’d suggest to keep it to level 1, to be as fast as possible.
  • Decompression speed is the same no matter the compression level.
2 Likes

Note: in recent version of yunohost (maybe in testing?) backup archives are not compressed at all.

Is that supposed to be the case in production in the future ?
If yes, why ? Compression saves us a lot of disk space (which is expensive/limited for most users)
And if it’s a matter of speed, I’d think that this algorithm could solve the problem.

The rationale is the following : when we’re talking about “big” yunohost backups (talking about > 1~10GB), the bulk of the data usually corresponds to multimedia files, which are already compressed - and therefore trying to compress these is a huge waste of time/cpu (because the algorithm desperately tries to compress something) for no benefit. We saw many people reporting their backup/restore taking a huge time, which is probably related to this.

If disk space really is a concern and users are confident that compression is relevant in their case, it’s still possible to compress/uncompress after/before a backup (and if that’s really relevant, we could have some action to perform this easlily from the webadmin or so - but honestly the real answer is just to go for borg integration in the core)

1 Like

There is also database files which (I believe) compress well, and can be quite big. But maybe all backup exports are already compressed ?
edit: I did a quick test, a basic, non yunohost based MYSQL database export from my server went from 2GB to 600MB, which is decent (multiplies this by for instance 10 weekly backups, that’s something). It took 3 seconds.
Another test: my Wallabag export (including images for each article) is about 1,8GB, and 1,4GB in a tar.gz (1,3GB with tar.zst). Proba

Zstandard has a build-in heuristic which avoid compressing data that is not possible to compress (enough). This is supposed to have minimal time and cpu cost (instead of trying to compress what cannot really be compressed).
Based on the issue you reported, I suppose it’s not the case for gzip.
If that’s right, maybe that’s the intermediate solution ?

And in my case (raspberry pi, with slow storage) I noticed backup time are mainly due to slow storage and simultaneous reads and writes on the same storage, rather than CPU time (yet it’s still an important factor).

In my case it would be a real concern, saving a few hundred MB per backup is really worth it, but I understand that indeed it’s better to put efforts in Borg integration than workaround. Yet instead of having to (un)compress manually, could it be possible to disable compression by default, but keep it as a backup command option ? (I suppose it would require little effort on the developpement side, but I’ll let you judge).

Apparently yes, this was implemented because we thought somebody would come and ask for this sooner or later …

Note that this is still only .tar.gz though … Thing is that .tar.gz support is automatically supported by the tar python lib, and I doubt (but could be wrong) that more “exotic”/lessknown algorithm like zst are not supported. Hence to properly integrate this, would have to dig deeper into how the library works and naively that’s not trivial (- oooor write the full archive on disk and then compress it, which feels not optimal at all).

1 Like

That’s good news ! :tada:

Ok, I thought it was done with bash scripts (I forgot yunohost used python) and in that case that was trivial as it was just a matter of one argument in the tar command.

If anyone is willing to implement such a feature, I’d be happy to help beta testing, but I understand it’s unlikely to happen.
(And meanwhile I could do it manually with zstd if needed)

1 Like

For the record: backups are no longer compressed since Yunohost 4.1 that is out now: YunoHost 4.1 release / Sortie de YunoHost 4.1

1 Like

btw. Whats about to use GitHub - borgbackup/borg: Deduplicating archiver with compression and authenticated encryption. ?
Some of the goals are:

  • Space efficient storage (Deduplication based on content-defined chunking)
  • All data can be protected using 256-bit AES encryption, data integrity and authenticity is verified using HMAC-SHA256
  • compression like: zstd
  • Borg can store data on any remote host accessible over SSH
  • Backups mountable as filesystems via FUSE

There’s an app (well, two) borg_ynh and borgserver_ynh

The integration in the core is ~ongoing and should be done in the next 6 months …

4 Likes

I did some test with uncompressed archives compared to compressed ones (Thanks a lot to @Maniack_Crudelis for implementing this into archivist). Here is the detailed result: Allow to choose the compression algorithm by maniackcrudelis · Pull Request #12 · maniackcrudelis/archivist · GitHub
TL;DR : a single weekly backup is ~4GB bigger (compressed backup size : roughly 5GB), almost twice as big. If we extrapolate this, 3 weekly backups + 1 monthly one would result in a ~16GB increase in storage space.
In my own case, I can’t afford such a big storage loss - in particular it prevent me from doing any further backup, because I’m out of space with another 9GB backup (I’ll have to remove an older one before creating the new one, I’m not a big fan of that).
The full backup seems to be only a few minutes longer with compression enabled (but I did not compare that precisely), for ~10 backups including 4 big ones (nextcloud, synapse, pixelfed, wordpress). Thanks Zstandard :slight_smile:

I’ll not argue against disabling compression by default, I understand the reason behind this choice and I am not in a good place to say if it was worth it or not.
But couldn’t we keep an option for that ? :pray:

Yes, c.f. YunoHost 4.1 release / Sortie de YunoHost 4.1 - #15 by Maniack_Crudelis

Not really documented anywhere in the doc for now though but that could be a small easy contribution if anybody’s interested ¯\_(ツ)_/¯

But (as indicated in this comment) this feature is buggy.

In addition is doesn’t give us the choice on a per-backup basis. That’s not a problem for me, but still I don’t get why it’s not an option in the backup command to keep old behavior (compression).

Indeed … but then that’s a bug and that should be fixed …

If you’re referring to the “old” --no-compress option, that option was ~misleading w.r.t. to the gzipped/not-gzipped. What this option actually did was to create an entire “raw” backup directory (so basically a backup not even .tar-ed). This behavior should still be reproducable with the --method copy option … but honestly i just don’t know how useful this is (and it’s not really well tested)

Alternatively, regarding compressing backups, you may still run manually the gzip command after creating the backup:

yunohost backup create my_backup_name [other option]
gzip /home/yunohost.backup/archives/my_backup_name.tar

and that’s it … of course, could have better integration to have this choice in the webadmin … But honestly if you really have issues with backup sizes, you should really consider using borg (and yeah at somepoint soon™ we’ll have borg in the core as we’re saying since like 3 years lol … :/)

That’s so great. Borg is great, but a bit daunting to setup properly with pruning and such.

1 Like

To play devils advocate here, storage is so cheap now that for most users, faster is better than smaller.

I know I can buy a 3 TB drive for about $100 US.

When things move to Borg, we can have deduplication, and pruning which helps a lot.

If a user is seriously worried about disk space, I would suggest Borg backup with Borgmatic (GitHub - witten/borgmatic: Simple, configuration-driven backup software for servers and workstations) to help with automating pruning and stuff. Borgimatic is analogous to docker and docker-compose.

I am not referring to this one (and I never used it).

That’s what I’m going to do for manual backups. And for archivist ones it’s now included, but I’ll still need to manually delete tar backups. And any of my (automated) backup will use around 20 extra GB (for >5GB compressed backup), which means I’d need to always keep ≥25GB of free space just to be able to backup… I am running fine with ~15GB free space on my VPS for almost a year, now I’ll need to migrate to another VPS (thus reinstall everything) or do the backup manually :frowning:

My point was that the old behavior could have been kept (as it was already developed and tested, I suppose it would not have been difficult… but I’m not the dev here) as an option, while use no compression by default.
That global setting is at least something (worse than a choice for each backup, but still useful), but it’s really hidden :confused:

But for a lot of users, increasing their storage space is not simple or not really possible.
And that upgrade add the need for (much) more storage, that was possibly not planned by Yunohost users before buying their servers.