Backups: Implementing Zstandard compression instead of Gzip?

Aleks · September 5, 2021, 7:49pm

Yes, yet the current situation is imho better in terms of “not having unusable archives because somehow the whole tar/gz checksum got randomly corrupted”, which was the initial point of that change.

This is also related to the fact that the old code was doing both the archiving (tar) and compression (gz) at the same time, increasing the number of issues that may arises. Doing both thing sequentially (so compressing the backup only when it’s done) should be more robust while also being more flexible in terms of what compression algorithm to use. But it’s not that trivial because of other thing (e.g. backup_info must be able to access info.json on the fly without uncompressing the entire archive)

Anyway, as pointed out previously, volunteers time is limited, there’s only one thousand topics to deal with in parallel and it’s not really high priority, the high priority being borg… That doesn’t stop anybody from working on this, though.

Last but not least, your example seems highly biases : I’m quite surprised that you’re able to compress 6GB of archive into 200 MB … What are these data ? It sounds like it may be 6GB of non-multimedia files (or redundant files) such that it’s possible to obtain a high nice compression ratio somehow. I don’t have any quantified study for this but I’m guessing that in most cases, people either have a bunch of multimedia files that are not compressible, or not-such-a-large-amount-of-data.

Lapineige · September 5, 2021, 9:04pm

Oh, I wasn’t aware of that !
I understood that was motivated by too long backup times for some users. That’s why I thought it wasn’t a good choice, at least not to keep previous behaviour as an (deprecated ?) option.

Right now the “workflow” with archivist is fine, except that you have to create both a tar file and a compressed backup, and manually delete those .tar backups (and often automated backups fails before the backup list is complete, because you’re out of space).
But maybe that’s something we should investigate with @Maniack_Crudelis.

I realized a few minutes ago that there was a log issue that made the log file grow up to 5GB. Indeed, my example is really bad.
Yet most small backups (<1GB), such as Yunohost core for instance, are divided 3-5 times, that can make a few GB at the end. And bigger ones can be divided by ~2 on my biggest backups (wallabag going from 2.5GB down to 1.5GB, Nextcloud from 1GB to 300MB, …). A complete backup behind reduced by 5-6GB, when that makes 10 or 20% of your server storage, it counts

Regarding that point : doesn’t the external .info.json file serve that purpose ?

Aleks · September 5, 2021, 11:58pm

Yup, that speed things up, but the thing is that it’s also included in the archive itself, for the case where you copy the archive to another server (e.g. migration) and will probably forget to copy the info.json … Or maybe we should change the design, idk, just highlighting why it’s not 100% straightforward

Lapineige · February 10, 2023, 3:21pm

Running into issue again with non compressed backup taking huge amount of space on a very limited storage space…
Some apps can no longer be backed up in fact, because it takes around 25GB (around 5 compressed) and I have only 15 left (and can’t expand, and nevertheless it’s costly).

Is there any update on this ? YunoHost 4.1 release / Sortie de YunoHost 4.1 - #15 by Maniack_Crudelis

Is the bug with the compression setting fixed ?

How could we help to implement a single-backup option to compress it ?
What should we look for/adapt/be aware of ?

ljf · February 10, 2023, 3:59pm

Did you try the option in general settings (in webadmin) to compress your backup ?

Lapineige · February 10, 2023, 4:12pm

No, because as reported in the linked message, it was buggy at that time, and I don’t want to break my production server. That’s why I’m asking first if there way any change, either in the code or in the design decisions around such features.

So far I’m doing a normal backup then a Zstandard compression, but this means I need both [tar file file size] + [compressed archive file size] storage to finish one single backup.

Lapineige · April 23, 2023, 1:30pm

Trying to this topic again

This compression issue is really troublesome for me, I have many automated backups which fails because of the last of space for uncompressed tar (roughly 50GB and growing¹), while there is just enough the space for compressed backups.
In some servers I can increase the storage space but with a significant price increase while it’s mainly unused space 95% of the time, on others I can’t without buying a new SSD (and it’s costly/wasted ressources).

¹ to be precise, it’s not a whole 50GB backup, but a set of them, and using archivist to create them it creates both compressed and uncompressed .tar, and I can’t cleanup (manually…) the mess before everything ends.

I’m really willing to help developing this (but I will need guidance, and can’t afford to create a dev environnement to test this), or at least to test it if someone is able to tackle this.

But first of all I need to know this:

Is the bug with the compression setting fixed ?

I know nothing of Yunohost core backup feature, so I might be very wrong, but from an external point of view to me implementing this seems quite simple on the compression part: you have a list of files to put in an archive, you just create a tar.zst with the appropriate command using that file list.

About the point you raised @Aleks :

(e.g. backup_info must be able to access info.json on the fly without uncompressing the entire archive)

Isn’t that provided by the tar(.something) command that allows you to extract one file (content) from the archive ?

How could we help to implement a single-backup option to compress it ?
What should we look for/adapt/be aware of ?

too.

Maybe I should open another topic and migrate this discussion ? I still think it would be better to implement Zstandard instead of Gzip for the sake of speed and CPU load, so I think it’s related.

Lapineige · April 23, 2023, 1:42pm

As a workaround : @Maniack_Crudelis in archivist you implemented a “symbolic link” option for “no compression”.
This, if I understand it correctly, is a way to create the backup on the drive (?), and create a symlink to it in Yunohost folder. So no extra use of space.
Could something similar be implemented for compressed backups ?

That would mean these backups could not be restored right away (and except if we include some config panel option, couldn’t be decompressed without the command line), but there could still be listed (with their json ?) in default backup folder.

I suppose your code make a classical backup and then compress it ? It’s not optimal, but in such a situation, it could remove the uncompressed .tar after compressing it, right ?
This would still require [compressed + uncompressed] storage space for each archive, but in the end:

for the whole backup of all apps, you would need only [all compressed backup archive + uncompressed last backup] free storage space to perform all backups.
there would no more be doubles between compressed and uncompressed backups.

Maniack_Crudelis · April 25, 2023, 9:49am

It was previously a tar.gz directly from the YunoHost backup command, but since the backup do not compress at all, the tar file is compressed afterward from the tar file.

github.com

mcrudelis/archivist/blob/master/archivist.sh#L479-L482


      
          tar --create --acls --preserve-permissions --xattrs --absolute-names \
              --file "$backup_dir/ynh_backup/$backup_name.$ynh_compression_suffix" \
              --$ynh_compression_mode \
              /home/yunohost.backup/archives/$backup_name.{tar,info.json}

So yes, it somehow duplicate the file to create a compressed version of it. It may remove the backup after the compression is done, but you would still need to have enough storage to make the uncompressed backup and the compressed one at the same time.
Yet, to remove the uncompressed backup after the compression is done could help you free some space, but it would be impossible to restore a backup from YunoHost after that. You would have to use Archivist_restorer.sh to do it. Could be implemented though.

That’s it, that’s a way to move your backup elsewhere without duplicating the backup file.
Best would be to be able to do that with any compressed backup, so you wouldn’t have a duplicate of each backup and still have the ability to list and restore backups from YunoHost.

As a reminder though, Archivist was not intended to replace any YunoHost commands, and even less to repair regressions on the backup system. I can manage to do some tricks, because I’m impacted as everyone. But best would be to fix that backup system and listen to the users.

Archivist first usage was and still is to create regular backups, and to duplicate them into different places to assure redundancy.

Lapineige · April 25, 2023, 6:20pm

Thank you for your input.

Impossible from the web admin UI, but doable if one know enough CLI to extract the archive, right ?
Could an admin panel UI be implement to list all archivist (compressed) backups and have an action to decompress them into yunohost backup archive folder ?

I’m not 100% sure to understand what this script does.
Am I right if I say it’s a kind of wrapper for 1) decompress command and maybe either 2) extract to original location or 3) yunohost backup restore command ?

What would be your position on pull requests implementing such kind of features ?

Maniack_Crudelis · April 25, 2023, 9:36pm

Not easily, but yes you could do that, Archivist_restorer.sh does that there.

Not easily as far as I know. But it may have change since I’ve look into it.

It’s all in the code:

github.com

mcrudelis/archivist/blob/master/Archivist_restorer.sh#L12-L15


      
          echo "What to do with that backup file ?"
          echo "1) Extract inplace for manual exploration (Run as root to restore permissions)."
          echo "2) Restore a regular backup to its destination."
          echo "3) Restore a YunoHost backup."

If it works, I’m not against any improvement.

wbk · January 20, 2024, 11:00am

I didn’t search the forum to see whether there’s a more actual conversation about this subject, sorry for raising the dead

Instead of implementing compression in code, could it be implemented in the filesystem?

There is a relatively large number of compression FUSE options, apt search fuse compres gives a not-optimized list of related packages (not optimized, as, showing packages that are not useful and probably missing a few that are useful), a cursory search on the 'Net gives many more.

The different modules offer different options, not all of them writing as well as reading, so not all of them would be a match.

The benefit would be that for all means and purposes, the compressed archives should be transparent to other applications. Accessing them would incur some penalty in RAM/CPU, but in many cases CPU speed is better than disk speed so it could even be faster than regular file access. In the end, it is mostly data at rest, so some penalty would probably not be a major hurdle.

Lapineige · January 20, 2024, 2:15pm

Well in that case I would install the whole system on a Btrfs filesystem, with Zstandard compression enabled.
This doesn’t solve the issue of downloading a very big archive instead of a tiny one, to offload those backups. It’s very critical on slow internet connections.

wbk · January 20, 2024, 8:03pm

I must have skipped over that requirement in the rest of thread

Even so, my suggestion was not a one size fits all solution, but only thrown in in case it was overlooked.

When downloading the backups it would be beneficial to have minimal network traffic, I see the disadvantage. Would the backups not be available as regular archives when circumventing the FUSE mount point?

Lapineige · January 20, 2024, 9:55pm

It wasn’t really expressed

But yes, your proposal is interesting, I just feel like it’s 2 step backwards compared to what we had previously (one because there is no compression out of the box, two because you have a complicated extra step to have it back, partly).

I’m not sure to understand your point, but if you download uncompressed file, the source filesystem doesn’t matter.