Backups: Implementing Zstandard compression instead of Gzip?

Hello everyone,

I recently discovered Zstandard (later abbreviated to zdst) compression, which is an extremely fast (almost realtime) compression algorithm with decent compression performance.
I have a proposal concerning Yunohost backup archives.

Proposal

In Yunohost backups, replace (Tar)Gzip archives by Zstd archives.

Advantages

  • (Several times) Faster backups
  • (Several times) Faster restore
  • Slightly smaller backups
  • Potentially less CPU load (it ends faster)

Drawbacks

  • This new format might confuse some users (who may need to learn new commands, even if there are very similar to gzip commands)
  • Would it broke some app scripts ? Custom user scripts ? External tools ?

Motivation

Archiving and decompressing backups in Yunohost takes time, especially on unpowerful hardware (such as raspberry pi).
Yunohost still use gzip algorithm, which is now a lot slower than other standard algorithms, and not that good in terms of compression ratio. Using a more modern and efficient algorithm could provide our users faster backup&restore, while using less disk space.

Technical details - and why choose Zstandard

Compared to Gzip, according to this benchmark https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/ (privacy warning: facebook related link), it has a similar compression ratio but compresses about 5 times faster and decompresses about 3.5~4 times faster.
This benchmark also compares it with other common algorithms, such has lz4, zlib, xz.
Short summary: lz4 is lighting fast, but worse in compression ratio and zsdt if fast enough (around 7s/GB) ; zlib is slower ; xz compress more but is almost 100 times slower.

Exemple: a 291MB backup from my Yunohost gives a 68.6MB tar.gzip (after ~10 seconds), a 52.7MB tar.xz (in ~1min), and a 62.8MB tar.zst (after ~1.5s, 10% smaller than gzip).
NB: this test was done with a fast desktop computer, with fast CPU and storage. The times would be a lot longer on most of our users hardware (raspberry pi, ā€¦).

Also Zstandard seems to be supported in all modern distributions, yet Iā€™m unsure of its support status on Debian.

Which compression level is better ?

Here is a summary based on personnel benchmarks and some internet research (I can provide details if needed).

  • Compression ratio is very similar between most compression levels (from 1 to 22). A significant bonus is observed for levels higher than 10~15, but the speed cost is very high for little gain. Iā€™d suggest to keep it to level 1, to be as fast as possible.
  • Decompression speed is the same no matter the compression level.
5 Likes

Note: in recent version of yunohost (maybe in testing?) backup archives are not compressed at all.

1 Like

Is that supposed to be the case in production in the future ?
If yes, why ? Compression saves us a lot of disk space (which is expensive/limited for most users)
And if itā€™s a matter of speed, Iā€™d think that this algorithm could solve the problem.

The rationale is the following : when weā€™re talking about ā€œbigā€ yunohost backups (talking about > 1~10GB), the bulk of the data usually corresponds to multimedia files, which are already compressed - and therefore trying to compress these is a huge waste of time/cpu (because the algorithm desperately tries to compress something) for no benefit. We saw many people reporting their backup/restore taking a huge time, which is probably related to this.

If disk space really is a concern and users are confident that compression is relevant in their case, itā€™s still possible to compress/uncompress after/before a backup (and if thatā€™s really relevant, we could have some action to perform this easlily from the webadmin or so - but honestly the real answer is just to go for borg integration in the core)

2 Likes

There is also database files which (I believe) compress well, and can be quite big. But maybe all backup exports are already compressed ?
edit: I did a quick test, a basic, non yunohost based MYSQL database export from my server went from 2GB to 600MB, which is decent (multiplies this by for instance 10 weekly backups, thatā€™s something). It took 3 seconds.
Another test: my Wallabag export (including images for each article) is about 1,8GB, and 1,4GB in a tar.gz (1,3GB with tar.zst). Proba

Zstandard has a build-in heuristic which avoid compressing data that is not possible to compress (enough). This is supposed to have minimal time and cpu cost (instead of trying to compress what cannot really be compressed).
Based on the issue you reported, I suppose itā€™s not the case for gzip.
If thatā€™s right, maybe thatā€™s the intermediate solution ?

And in my case (raspberry pi, with slow storage) I noticed backup time are mainly due to slow storage and simultaneous reads and writes on the same storage, rather than CPU time (yet itā€™s still an important factor).

In my case it would be a real concern, saving a few hundred MB per backup is really worth it, but I understand that indeed itā€™s better to put efforts in Borg integration than workaround. Yet instead of having to (un)compress manually, could it be possible to disable compression by default, but keep it as a backup command option ? (I suppose it would require little effort on the developpement side, but Iā€™ll let you judge).

1 Like

Apparently yes, this was implemented because we thought somebody would come and ask for this sooner or later ā€¦

Note that this is still only .tar.gz though ā€¦ Thing is that .tar.gz support is automatically supported by the tar python lib, and I doubt (but could be wrong) that more ā€œexoticā€/lessknown algorithm like zst are not supported. Hence to properly integrate this, would have to dig deeper into how the library works and naively thatā€™s not trivial (- oooor write the full archive on disk and then compress it, which feels not optimal at all).

2 Likes

Thatā€™s good news ! :tada:

Ok, I thought it was done with bash scripts (I forgot yunohost used python) and in that case that was trivial as it was just a matter of one argument in the tar command.

If anyone is willing to implement such a feature, Iā€™d be happy to help beta testing, but I understand itā€™s unlikely to happen.
(And meanwhile I could do it manually with zstd if needed)

1 Like

For the record: backups are no longer compressed since Yunohost 4.1 that is out now: YunoHost 4.1 release / Sortie de YunoHost 4.1

1 Like

btw. Whats about to use GitHub - borgbackup/borg: Deduplicating archiver with compression and authenticated encryption. ?
Some of the goals are:

  • Space efficient storage (Deduplication based on content-defined chunking)
  • All data can be protected using 256-bit AES encryption, data integrity and authenticity is verified using HMAC-SHA256
  • compression like: zstd
  • Borg can store data on any remote host accessible over SSH
  • Backups mountable as filesystems via FUSE

Thereā€™s an app (well, two) borg_ynh and borgserver_ynh

The integration in the core is ~ongoing and should be done in the next 6 months ā€¦

5 Likes

I did some test with uncompressed archives compared to compressed ones (Thanks a lot to @Maniack_Crudelis for implementing this into archivist). Here is the detailed result: Allow to choose the compression algorithm by maniackcrudelis Ā· Pull Request #12 Ā· maniackcrudelis/archivist Ā· GitHub
TL;DR : a single weekly backup is ~4GB bigger (compressed backup size : roughly 5GB), almost twice as big. If we extrapolate this, 3 weekly backups + 1 monthly one would result in a ~16GB increase in storage space.
In my own case, I canā€™t afford such a big storage loss - in particular it prevent me from doing any further backup, because Iā€™m out of space with another 9GB backup (Iā€™ll have to remove an older one before creating the new one, Iā€™m not a big fan of that).
The full backup seems to be only a few minutes longer with compression enabled (but I did not compare that precisely), for ~10 backups including 4 big ones (nextcloud, synapse, pixelfed, wordpress). Thanks Zstandard :slight_smile:

Iā€™ll not argue against disabling compression by default, I understand the reason behind this choice and I am not in a good place to say if it was worth it or not.
But couldnā€™t we keep an option for that ? :pray:

Yes, c.f. YunoHost 4.1 release / Sortie de YunoHost 4.1 - #15 by Maniack_Crudelis

Not really documented anywhere in the doc for now though but that could be a small easy contribution if anybodyā€™s interested ĀÆ\_(惄)_/ĀÆ

But (as indicated in this comment) this feature is buggy.

In addition is doesnā€™t give us the choice on a per-backup basis. Thatā€™s not a problem for me, but still I donā€™t get why itā€™s not an option in the backup command to keep old behavior (compression).

Indeed ā€¦ but then thatā€™s a bug and that should be fixed ā€¦

If youā€™re referring to the ā€œoldā€ --no-compress option, that option was ~misleading w.r.t. to the gzipped/not-gzipped. What this option actually did was to create an entire ā€œrawā€ backup directory (so basically a backup not even .tar-ed). This behavior should still be reproducable with the --method copy option ā€¦ but honestly i just donā€™t know how useful this is (and itā€™s not really well tested)

Alternatively, regarding compressing backups, you may still run manually the gzip command after creating the backup:

yunohost backup create my_backup_name [other option]
gzip /home/yunohost.backup/archives/my_backup_name.tar

and thatā€™s it ā€¦ of course, could have better integration to have this choice in the webadmin ā€¦ But honestly if you really have issues with backup sizes, you should really consider using borg (and yeah at somepoint soonā„¢ weā€™ll have borg in the core as weā€™re saying since like 3 years lol ā€¦ :/)

Thatā€™s so great. Borg is great, but a bit daunting to setup properly with pruning and such.

1 Like

To play devils advocate here, storage is so cheap now that for most users, faster is better than smaller.

I know I can buy a 3 TB drive for about $100 US.

When things move to Borg, we can have deduplication, and pruning which helps a lot.

If a user is seriously worried about disk space, I would suggest Borg backup with Borgmatic (GitHub - witten/borgmatic: Simple, configuration-driven backup software for servers and workstations) to help with automating pruning and stuff. Borgimatic is analogous to docker and docker-compose.

I am not referring to this one (and I never used it).

Thatā€™s what Iā€™m going to do for manual backups. And for archivist ones itā€™s now included, but Iā€™ll still need to manually delete tar backups. And any of my (automated) backup will use around 20 extra GB (for >5GB compressed backup), which means Iā€™d need to always keep ā‰„25GB of free space just to be able to backupā€¦ I am running fine with ~15GB free space on my VPS for almost a year, now Iā€™ll need to migrate to another VPS (thus reinstall everything) or do the backup manually :frowning:

My point was that the old behavior could have been kept (as it was already developed and tested, I suppose it would not have been difficultā€¦ but Iā€™m not the dev here) as an option, while use no compression by default.
That global setting is at least something (worse than a choice for each backup, but still useful), but itā€™s really hidden :confused:

But for a lot of users, increasing their storage space is not simple or not really possible.
And that upgrade add the need for (much) more storage, that was possibly not planned by Yunohost users before buying their servers.

And also: I donā€™t care if my daily/weekly backup takes even 1 more hour (that wonā€™t happen) to complete, when it runs in the middle of the night. This cost me ~0. Lots of extra storage cost me, especially when rented with a VPS or such.

Hello,

I second what Lapineige said. I have a VPS with only 40Gb disk; and I upgraded from a 20Gb one two years ago, essentially to avoid having to delete Yunohost backups all the time - I could have only one backup at a time. Now, with 40Gb, I thought I was fine for a time. Not anymoreā€¦ My backups, zipped, are roughly 4Gb, and I could have two or three before deleting one. The new backups are 6.6Gb, and for Yunohost to make them it seems I need even more. Right more I have 7.9Gb of available space and the automatic backup failed this night.

Is it planned to add an option to zip or not the automatic backups?

(anyway, itā€™s my first message on this forum, Iā€™m sorry itā€™s for complaining, the Yunohost team do an amazing job!)

1 Like

My workaround for the moment is using the new option in archivist app (Allow to choose the compression algorithm by maniackcrudelis Ā· Pull Request #12 Ā· maniackcrudelis/archivist Ā· GitHub, thanks a lot @Maniack_Crudelis for that <3).
Itā€™s not ideal, especially because there is always a double (one in yunohost backup folder, one in archivist) that has to be removed manually.

Still, being able to compress it (and you can choose the algorithm, what a luxury :tada:) saves a ton of space, with a very small cost (the biggest backups take 2min of extra time, at 3 am thatā€™s not a big dealā€¦).

1 Like

For the record, compressing a Pixelfed backup, 6GB tar archive, to tar.zstd (simply using zsdt my_backup_file.tar) took only 45s and resulted in aā€¦ 200MB compressed archive, saving ~97% of the storage space ! (almost 6GB saved for a single backupā€¦)

That was on an Hetzner CX11, the most entry-level of their VPS (1 CPU, ā€¦) - not a computing beast.
Decompressing it took 20s.

With gzip, it took 1min 43s, for a 336MB file (1,7 times bigger, 3 to 4 times slower).
Indeed, if you do 10 big backups like this (which is unusual), and have an overhead of 15-20min (or more) during the backup, that might be an issue (still, for a lot of people, the issue is more on the storage spaceā€¦).
But with ZStandard, that overhead is greatly reduced. And is saving potentially tens of GB !

Not being able any more to compress backup - even with a option disabled by default - is a pity :frowning:

2 Likes