I recently discovered Zstandard (later abbreviated to zdst) compression, which is an extremely fast (almost realtime) compression algorithm with decent compression performance.
I have a proposal concerning Yunohost backup archives.
Proposal
In Yunohost backups, replace (Tar)Gzip archives by Zstd archives.
Advantages
(Several times) Faster backups
(Several times) Faster restore
Slightly smaller backups
Potentially less CPU load (it ends faster)
Drawbacks
This new format might confuse some users (who may need to learn new commands, even if there are very similar to gzip commands)
Would it broke some app scripts ? Custom user scripts ? External tools ?
Motivation
Archiving and decompressing backups in Yunohost takes time, especially on unpowerful hardware (such as raspberry pi).
Yunohost still use gzip algorithm, which is now a lot slower than other standard algorithms, and not that good in terms of compression ratio. Using a more modern and efficient algorithm could provide our users faster backup&restore, while using less disk space.
Technical details - and why choose Zstandard
Compared to Gzip, according to this benchmark https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/ (privacy warning: facebook related link), it has a similar compression ratio but compresses about 5 times faster and decompresses about 3.5~4 times faster.
This benchmark also compares it with other common algorithms, such has lz4, zlib, xz.
Short summary: lz4 is lighting fast, but worse in compression ratio and zsdt if fast enough (around 7s/GB) ; zlib is slower ; xz compress more but is almost 100 times slower.
Exemple: a 291MB backup from my Yunohost gives a 68.6MB tar.gzip (after ~10 seconds), a 52.7MB tar.xz (in ~1min), and a 62.8MB tar.zst (after ~1.5s, 10% smaller than gzip). NB: this test was done with a fast desktop computer, with fast CPU and storage. The times would be a lot longer on most of our users hardware (raspberry pi, ā¦).
Also Zstandard seems to be supported in all modern distributions, yet Iām unsure of its support status on Debian.
Which compression level is better ?
Here is a summary based on personnel benchmarks and some internet research (I can provide details if needed).
Compression ratio is very similar between most compression levels (from 1 to 22). A significant bonus is observed for levels higher than 10~15, but the speed cost is very high for little gain. Iād suggest to keep it to level 1, to be as fast as possible.
Decompression speed is the same no matter the compression level.
Is that supposed to be the case in production in the future ?
If yes, why ? Compression saves us a lot of disk space (which is expensive/limited for most users)
And if itās a matter of speed, Iād think that this algorithm could solve the problem.
The rationale is the following : when weāre talking about ābigā yunohost backups (talking about > 1~10GB), the bulk of the data usually corresponds to multimedia files, which are already compressed - and therefore trying to compress these is a huge waste of time/cpu (because the algorithm desperately tries to compress something) for no benefit. We saw many people reporting their backup/restore taking a huge time, which is probably related to this.
If disk space really is a concern and users are confident that compression is relevant in their case, itās still possible to compress/uncompress after/before a backup (and if thatās really relevant, we could have some action to perform this easlily from the webadmin or so - but honestly the real answer is just to go for borg integration in the core)
There is also database files which (I believe) compress well, and can be quite big. But maybe all backup exports are already compressed ?
edit: I did a quick test, a basic, non yunohost based MYSQL database export from my server went from 2GB to 600MB, which is decent (multiplies this by for instance 10 weekly backups, thatās something). It took 3 seconds.
Another test: my Wallabag export (including images for each article) is about 1,8GB, and 1,4GB in a tar.gz (1,3GB with tar.zst). Proba
Zstandard has a build-in heuristic which avoid compressing data that is not possible to compress (enough). This is supposed to have minimal time and cpu cost (instead of trying to compress what cannot really be compressed).
Based on the issue you reported, I suppose itās not the case for gzip.
If thatās right, maybe thatās the intermediate solution ?
And in my case (raspberry pi, with slow storage) I noticed backup time are mainly due to slow storage and simultaneous reads and writes on the same storage, rather than CPU time (yet itās still an important factor).
In my case it would be a real concern, saving a few hundred MB per backup is really worth it, but I understand that indeed itās better to put efforts in Borg integration than workaround. Yet instead of having to (un)compress manually, could it be possible to disable compression by default, but keep it as a backup command option ? (I suppose it would require little effort on the developpement side, but Iāll let you judge).
Apparently yes, this was implemented because we thought somebody would come and ask for this sooner or later ā¦
Note that this is still only .tar.gz though ā¦ Thing is that .tar.gz support is automatically supported by the tar python lib, and I doubt (but could be wrong) that more āexoticā/lessknown algorithm like zst are not supported. Hence to properly integrate this, would have to dig deeper into how the library works and naively thatās not trivial (- oooor write the full archive on disk and then compress it, which feels not optimal at all).
Ok, I thought it was done with bash scripts (I forgot yunohost used python) and in that case that was trivial as it was just a matter of one argument in the tar command.
If anyone is willing to implement such a feature, Iād be happy to help beta testing, but I understand itās unlikely to happen.
(And meanwhile I could do it manually with zstd if needed)
I did some test with uncompressed archives compared to compressed ones (Thanks a lot to @Maniack_Crudelis for implementing this into archivist). Here is the detailed result: Allow to choose the compression algorithm by maniackcrudelis Ā· Pull Request #12 Ā· maniackcrudelis/archivist Ā· GitHub
TL;DR : a single weekly backup is ~4GB bigger (compressed backup size : roughly 5GB), almost twice as big. If we extrapolate this, 3 weekly backups + 1 monthly one would result in a ~16GB increase in storage space.
In my own case, I canāt afford such a big storage loss - in particular it prevent me from doing any further backup, because Iām out of space with another 9GB backup (Iāll have to remove an older one before creating the new one, Iām not a big fan of that).
The full backup seems to be only a few minutes longer with compression enabled (but I did not compare that precisely), for ~10 backups including 4 big ones (nextcloud, synapse, pixelfed, wordpress). Thanks Zstandard
Iāll not argue against disabling compression by default, I understand the reason behind this choice and I am not in a good place to say if it was worth it or not.
But couldnāt we keep an option for that ?
But (as indicated in this comment) this feature is buggy.
In addition is doesnāt give us the choice on a per-backup basis. Thatās not a problem for me, but still I donāt get why itās not an option in the backup command to keep old behavior (compression).
Indeed ā¦ but then thatās a bug and that should be fixed ā¦
If youāre referring to the āoldā --no-compress option, that option was ~misleading w.r.t. to the gzipped/not-gzipped. What this option actually did was to create an entire ārawā backup directory (so basically a backup not even .tar-ed). This behavior should still be reproducable with the --method copy option ā¦ but honestly i just donāt know how useful this is (and itās not really well tested)
Alternatively, regarding compressing backups, you may still run manually the gzip command after creating the backup:
and thatās it ā¦ of course, could have better integration to have this choice in the webadmin ā¦ But honestly if you really have issues with backup sizes, you should really consider using borg (and yeah at somepoint soonā¢ weāll have borg in the core as weāre saying since like 3 years lol ā¦ :/)
I am not referring to this one (and I never used it).
Thatās what Iām going to do for manual backups. And for archivist ones itās now included, but Iāll still need to manually delete tar backups. And any of my (automated) backup will use around 20 extra GB (for >5GB compressed backup), which means Iād need to always keep ā„25GB of free space just to be able to backupā¦ I am running fine with ~15GB free space on my VPS for almost a year, now Iāll need to migrate to another VPS (thus reinstall everything) or do the backup manually
My point was that the old behavior could have been kept (as it was already developed and tested, I suppose it would not have been difficultā¦ but Iām not the dev here) as an option, while use no compression by default.
That global setting is at least something (worse than a choice for each backup, but still useful), but itās really hidden
But for a lot of users, increasing their storage space is not simple or not really possible.
And that upgrade add the need for (much) more storage, that was possibly not planned by Yunohost users before buying their servers.
And also: I donāt care if my daily/weekly backup takes even 1 more hour (that wonāt happen) to complete, when it runs in the middle of the night. This cost me ~0. Lots of extra storage cost me, especially when rented with a VPS or such.
I second what Lapineige said. I have a VPS with only 40Gb disk; and I upgraded from a 20Gb one two years ago, essentially to avoid having to delete Yunohost backups all the time - I could have only one backup at a time. Now, with 40Gb, I thought I was fine for a time. Not anymoreā¦ My backups, zipped, are roughly 4Gb, and I could have two or three before deleting one. The new backups are 6.6Gb, and for Yunohost to make them it seems I need even more. Right more I have 7.9Gb of available space and the automatic backup failed this night.
Is it planned to add an option to zip or not the automatic backups?
(anyway, itās my first message on this forum, Iām sorry itās for complaining, the Yunohost team do an amazing job!)
Still, being able to compress it (and you can choose the algorithm, what a luxury ) saves a ton of space, with a very small cost (the biggest backups take 2min of extra time, at 3 am thatās not a big dealā¦).
For the record, compressing a Pixelfed backup, 6GB tar archive, to tar.zstd (simply using zsdt my_backup_file.tar) took only 45s and resulted in aā¦ 200MB compressed archive, saving ~97% of the storage space ! (almost 6GB saved for a single backupā¦)
That was on an Hetzner CX11, the most entry-level of their VPS (1 CPU, ā¦) - not a computing beast.
Decompressing it took 20s.
With gzip, it took 1min 43s, for a 336MB file (1,7 times bigger, 3 to 4 times slower).
Indeed, if you do 10 big backups like this (which is unusual), and have an overhead of 15-20min (or more) during the backup, that might be an issue (still, for a lot of people, the issue is more on the storage spaceā¦).
But with ZStandard, that overhead is greatly reduced. And is saving potentially tens of GB !
Not being able any more to compress backup - even with a option disabled by default - is a pity