Apps getting deleted on app upgrade crash

shine · May 22, 2020, 6:50am

My YunoHost server

Hardware: dedicated bare-metal server
YunoHost version: 3.8.4
I have access to my server : Through SSH
Are you in a special context or did you perform some particular tweaking on your YunoHost instance ? : no

Description of the issue

I got ( too ) excited about the latest release because of the diagnosis feature and jumped to upgrade my server to the latest release. I upgraded from 3.7.1.1 to 3.8.4.3

context

after the yunohost upgrade ( and playing around with the diagnostic tool ), I naturally proceeded to upgrade the yunohost apps in the system.
I had the following apps lined up for upgrade ( in that order ) :

hubzilla
netdata
synapse
riot
wordpress
diaspora
spip

The first upgrade of hubzilla failed because Job for nginx.service failed. I went digging…
I don’t know what happened, but since the yunohost system upgrade, my nginx takes exactly 1m41s seconds to do any action - be it reload, restart, start, stop, whatever; even nginx -t takes the exact same time to execute.
I also tried running it manually from the command-line and it took the exact same time. I couldn’t figure out what was the source of the problem, but I know it is isolated to nginx. All other actions on the server run fine and smooth. Due to this issue, systemd times out nginx by interrupting a reload command ( one that yunohost app upgrade runs too ).
I spent too much time ( a whole night ) trying to get to the bottom of this issue; and then finally gave up and increased TimeoutStartSec for nginx.service.

the actual problem

I spent too much time digging into the nginx issue that I forgot that I had pending app upgrades to do as well.
So, once I had increased the timeout for nginx, I came back to running the same upgrade command. This time I get thrown another error : The app hubzilla could not be found in the applications list. Indeed, it wasn’t there. That’s when I looked at the previous update logs again.

shine@yunohost:~$ sudo yunohost app upgrade hubzilla netdata synapse riot wordpress diaspora spip
Info: The following apps will be upgraded: hubzilla, netdata, synapse, riot, wordpress, diaspora, spip
Info: Now upgrading hubzilla…
Info: [+...................] > Loading installation settings... [00h00m,00s]
Info: [#++.................] > Backing up the app before upgrading (may take a while)... [00h00m,02s]
Info: Upgrading source files...
Info: [###++...............] > Upgrading source files... [00h01m,12s]
Info: [#####++.............] > Upgrading nginx web server configuration... [00h00m,19s]
Warning: Job for nginx.service failed.
Warning: See "systemctl status nginx.service" and "journalctl -xe" for details.
Warning: Invalid argument: --
Warning: [ERR] Upgrade failed.
Warning: dpkg: warning: while removing apache2-bin, directory '/var/lib/apache2' not empty so not removed
Warning: dpkg: warning: while removing dconf-gsettings-backend:amd64, directory '/usr/lib/x86_64-linux-gnu/gio/modules' not empty so not removed
Warning: Job for nginx.service failed.
Warning: See "systemctl status nginx.service" and "journalctl -xe" for details.
Warning: Invalid argument: --
Warning: hubzilla has not been properly removed
Warning: 105993 /!\ Packagers! This app is still using the skipped/protected/unprotected_uris/regex settings which are now obsolete and deprecated... Instead, you should use the new helpers 'ynh_permission_{create,urls,update,delete}' and the 'visitors' group to initialize the public/private access. Check out the documentation at the bottom of yunohost.org/groups_and_permissions to learn how to use the new permission mechanism.
Warning: 106926 Group 'visitors' already has permission 'hubzilla.main' enabled
Warning: 106927 This permission is currently granted to all users in addition to other groups. You probably want to either remove the 'all_users' permission or remove the other groups it is currently granted to.
Warning: 106929 The permission was not updated because the addition/removal requests already match the current state.
Warning: 197503 Job for nginx.service failed.
Warning: 197504 See "systemctl status nginx.service" and "journalctl -xe" for details.
Warning: 197506 Invalid argument: --
Warning: 198056 Could not restore the app 'hubzilla'
Warning: Traceback (most recent call last):
Warning:   File "/usr/lib/moulinette/yunohost/backup.py", line 1407, in _restore_app
Warning:     env=env_dict)[0]
Warning:   File "/usr/lib/moulinette/yunohost/hook.py", line 347, in hook_exec
Warning:     raise YunohostError('hook_exec_failed', path=path)
Warning: YunohostError: Could not run script: /tmp/restore38mzEL/restore
Warning: 198123 Here's an extract of the logs before the crash. It might help debugging the error:
Warning: 298921 Job for nginx.service failed.
Warning: 298922 See "systemctl status nginx.service" and "journalctl -xe" for details.
Warning: 299024 Invalid argument: --
Warning: 299191 hubzilla has not been properly removed
Warning: 303451 Nothing was restored
Warning: The app was restored to the way it was before the failed upgrade.
Error: Could not upgrade hubzilla: An error occurred inside the app upgrade script
Info: The operation 'Upgrade the 'hubzilla' app' could not be completed. Please share the full log of this operation using the command 'yunohost log display 20200522-022111-app_upgrade-hubzilla --share' to get help
Warning: Here's an extract of the logs before the crash. It might help debugging the error:
Info: DEBUG - 6694 + '[' '!' -e /etc/fail2ban/filter.d/hubzilla.conf ']'
Info: DEBUG - 6694 ++ realpath /etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6695 + src_path=/etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6695 + [[ -z '' ]]
Info: DEBUG - 6696 + dest_path=etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6696 + [[ -e etc/fail2ban/filter.d/hubzilla.conf ]]
Info: DEBUG - 6697 + local rel_dir=/apps/hubzilla/backup
Info: DEBUG - 6697 + rel_dir=/apps/hubzilla/backup/
Info: DEBUG - 6698 + dest_path=/apps/hubzilla/backup/etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6698 + dest_path=apps/hubzilla/backup/etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6698 ++ echo /etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6699 ++ sed --regexp-extended 's/"/\"\"/g'
Info: DEBUG - 6699 + local src=/etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6700 ++ echo apps/hubzilla/backup/etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6700 ++ sed --regexp-extended 's/"/\"\"/g'
Info: DEBUG - 6701 + local dest=apps/hubzilla/backup/etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6701 + echo '"/etc/fail2ban/filter.d/hubzilla.conf","apps/hubzilla/backup/etc/fail2ban/filter.d/hubzilla.conf"'
Info: DEBUG - 6702 ++ dirname /home/yunohost.backup/tmp/hubzilla-pre-upgrade1/apps/hubzilla/backup/etc/fail2ban/filter.d/hubzilla.conf
Info: DEBUG - 6703 + mkdir --parents /home/yunohost.backup/tmp/hubzilla-pre-upgrade1/apps/hubzilla/backup/etc/fail2ban/filter.d
Info: DEBUG - 6706 + echo '[####################] > Backup script completed for hubzilla. (YunoHost will then actually copy those files to the archive). [00h00m,01s]'
Info: DEBUG - 6707 + ynh_exit_properly
Error: The app 'hubzilla' failed to upgrade, and as a consequence the following apps' upgrades have been cancelled: hubzilla, netdata, synapse, riot, wordpress, diaspora, spip
Error: The operation 'Upgrade the 'hubzilla' app' could not be completed. Please share the full log of this operation using the command 'yunohost log display 20200522-022111-app_upgrade-hubzilla --share' to get help

I could see so many red flags there

Invalid argument: -- ( second line after the Job for nginx.service failed )
hubzilla has not been properly removed - why? why is an upgrade attempting to remove an app?
Could not restore the app 'hubzilla'

So, let me go through all of that together :

An app upgrade may crash due to many reasons ( even systemd service reloads )
deleting an app assuming that it can be installed later ( during a supposed app “upgrade” )
because of the upgrade process failing, the process is terminated

where does that leave me? with one app lesser than I had before I started the upgrade.

that’s not what I was expecting when I ran that command. I thought it was an isolated incident and let it slide.

and then nicely removed hubzilla from the list of apps and ran the command again ( hoping nothing would break this time ). This time PHP failed to reload because it couldn’t find the hubzilla path - /var/www/hubzilla ( remember we’re in partial installation state twice now ). and there I lost another app - netdata ( I didn’t care much about hubzilla because it didn’t work for me ; but I loved netdata ). And this happened a third time with synapse ( yes, synapse too ) and this time it was fail2ban complaining about hubzilla's jail. Thankfully, that was the last of hubzilla and the remainder of the apps upgraded in peace ( but what good is riot if I don’t have synapse running? I could even run riot locally from my machine; but I need to connect to my homeserver; which I don’t have anymore )

so, here’s a question : why are apps being deleted during a supposed “upgrade” process? Is this a YunoHost thing or under the control of the app developer? I think it is on the systems side, rather than the app.

also, there’s something fundamentally wrong about the way the process status checks are handled during the upgrade process. Otherwise, there shouldn’t be a situation where I even get to this state.

PS : I was able to debug all of this because I’m savvy with the terminal. That wouldn’t be the case with a novice user.

PPS : I’m more than 50% sure that this is how I lost my gitlab app after I had upgraded another minor version earlier; but then I wasn’t using that gitlab instance at all. I had installed it many years ago; but didn’t bother to use it. The only thing it was doing was eat a lot of the available RAM by just simply running. Sometimes I’d go and kill it when the system was choking due to lack of memory. Other times, I just didn’t care. So, when it got “accidentally deleted”. I thought “good riddance”; but that’s definitely not how I felt when I lost my netdata and synapse.

is there some way to triage this and get it fixed? we don’t want to break existing users from hitting these issues, do we?

Aleks · May 22, 2020, 10:44am

Yes, that kind of thing shouldn’t happen

Because the upgrade of the app failed, and if Yunohost or the app didn’t do anything, it would leave it in a broken state. So instead, what’s attempted is to restore the app as it was before. But to do so, it first gotta be removed… The real questions are :

why does the upgrade failed ?
why did the restore failed

and it seems related to nginx not being able to restart/reload …? Which pretty much explain it because any step like upgrade/remove/restore will eventually lead to the app willing to restart/reload nginx, and failing to do so will look like there’s a big issue in the app.

Which in turns seems related to your issue about it taking 1m41s to do stuff and time out … But supposedly you fixed the issue by increasing the timeout … But apparently it still does end up failing anyway …

Note that the automatic pre-upgrade safety backups of each of your apps are still there but first the issue about nginx should be fixed…

shine · May 22, 2020, 2:23pm

Oh, that makes sense now. At least, that explains why there are 3 attempts to reload nginx.

The thing that worries me is the line Invalid argument: --. I don’t think I’ve seen that in previous upgrades.
Where should I look in order to find the command that triggers the nginx reload? I think something has gone wrong there.

Also, the following log is also kind of misleading :

it’s not a fix, it’s a band-aid. now all of my nginx commands take that long to execute. if the upgrade commands are using nginx reload, and if I have more than 3 apps to upgrade, it could take forever to get done. I could go out, get coffee and come back and the upgrade would still be in the first app.

that’s re-assuring to hear; but I guess I already knew that too. like you said, the nginx problem is more grave. however, I don’t care about the band-aid though. what bothers me is the fact that my apps are getting deleted and I don’t want that happening.

Aleks · May 22, 2020, 2:39pm

It’s a small issue from the latest release and it’s fixed, we’ll release a hotfix soon™. It should cause any issue in itself because that’s just yunohost trying to display the logs of the failure to help debugging.

Anyway, back to nginx taking to long: if I understand correctly, nginx -t also takes an eternity to run.

We can try to debug why but that’s pretty technical … I would install strace with 'apt install strace' and run 'strace nginx -t'

That will display a shitload of info but maybe we can see where/why it hangs so much…

shine · May 24, 2020, 12:15am

oh, that’s a relief to hear. I was thinking that it had something to do with the error itself.

I can handle technical. I’m a “sysadmin” by profession.

I did think of strace, but I just didn’t have the time to go through the heap of logs that I’d get thrown at and go through all of that before posting here. I put it off for later, but then never went back to it ( I’m going back to it right now though. what else am I supposed to do on a Sunday AND during a lockdown? )

shine · May 24, 2020, 1:21am

that was easy. I could’ve done this the same day I was debugging the issue; but alas, I was assuming that it was Yunohost’s fault.

what happened was the first issue on my server diagnosis list :

        details: The file /etc/resolv.conf should be a symlink to /etc/resolvconf/run/resolv.conf itself pointing to 127.0.0.1 (dnsmasq). If you want to manually configure DNS resolvers, please edit /etc/resolv.dnsmasq.conf.
        status: WARNING
        summary: DNS resolution seems to be working, but it looks like you're using a custom /etc/resolv.conf.

I had my /etc/resolv.conf pointing to /etc/resolv.dnsmasq.conf. I thought that was OK too.

but then I noticed that /etc/resolv.dnsmasq.conf had 25 lines with both IPv4 and IPv6 address that I didn’t recognize ( or don’t recall ) putting there. now, I was thinking who even changed the symlink at all. I don’t recall doing that either; but then again, that change happened more than a year ago ( judging from the timestamp on the symlink ); so, I could be wrong too.

is ( or rather was ) Yunohost populating /etc/resolv.dnsmasq.conf at an any point? I don’t see why it would have to; but I just want to clarify.

nginx was trying to resolve my hostname from my /etc/hosts using the nameservers and was timing out. this went on ever 5 seconds for 1m40s when the DNS resolution eventually gave up and moved on.

just changing the symlink back to /etc/resolvconf/run/resolv.conf was sufficient to fix my problem.

my nginx problem is resolved now and now I can remove that timeout that I increased; but the underlying issue that I raised still remains :
if a process reload fails, shouldn’t there be some way where the app is not getting deleted?
I understand that the current flow is not incorrect. and that there is an app data backup that can be used to restore the app too.

I can’t think of a better way to handle this scenario either but I guess we should somehow be trying to handle this scenario though. I’m sure it’s not an edge case. This can happen to anyone.

Aleks · May 24, 2020, 5:26am

Hmmm yeah that’s a tricky issue … But if you know system administration you may know how system administration is hell …

In the past we regularly had issues like this where an app would fuck up the some conf (for example nginx) and that would trigger epicly stupid reaction (as in “not stopping to attempt to upgrade other apps”, so imagine what would happened if yunohost kept trying to upgrade your N other apps even if the first broke something)

Nowadays there are some checks and safeguard to prevent that kind of stuff (e.g. checking integrity of nginx conf with nginx -t, and other things …) but ultimately it’s difficult to cover every thing as your example about timeout shows … Or you can imagine that if an app wants to fuck up something on the system, it can perfectly break the nginx conf, and decide that even during the uninstall script it wont undo its fuckery, and bam, the app messed up your system.

So I would say this is really a limitation of the current paradigm of yunohost packaging, and in fact a limitation of the current paradigm of classical system administration. There is no analog of “transactional operation” in sysadmin such that you could automatically rollback to the same state you were before starting the transaction - as you would see in databases. Stuff like nix/guix (I don’t really know the name and differences) may kind of solve this but that raises many questions (e.g. if you rollback the system but there were changes in the data on the system, can you rollback only the system and not the data ?) but that’s not realistic to move yunohost to that system, it would basically just imply to rewrite the entire project.

Edit: oooor there’s also LVM but ugh I’m not sure it can be easily deployed on any hardware/VPS so that would only be a partial solution

shine · May 24, 2020, 7:42pm

That’s why I said I couldn’t think of a way to solve this issue either. I think what YunoHost is already doing is already more than enough for the general use-cases.

Yes, I remember that time. I guess that switch happened with the major version upgrade in 3.0 right?

I didn’t mean to imply that at all. I was only trying to raise the concern to this issue. I would’ve suggested a solution if I could think of a way ( trust me I did think really hard before I gave up and said I couldn’t think of anything ) to solve the issue; but I couldn’t think of anything. I’ll keep a neuron out to process it if I do come up with something.

aside question : where would I go to give praise to some system? I can’t find a place on the forum to say good things about something.
I can understand if people only came here for support or otherwise just rant, but I think we should also have at least a sub-category to give praise and appreciation as well.

tmb · May 25, 2020, 8:17pm

Just from reading that, I wonder why not check this before triggering the update?

Actually, as the app state is completely unknown, I think it does make some sense for an automated admin system to stay safe, remove such an app, and kill all remaining processes. Then wait for a skilled admin to load the backup and manually debug and fix the not automatically fixable situation.

shine · May 26, 2020, 7:35pm

do you mean that the user ( as in, me ) should’ve done it or do you mean YunoHost should’ve done it?
if you mean YunoHost, I don’t think it makes sense to do that before each app upgrade because these cases are not that usual though I don’t think they are not edge-cases either. However, if YunoHost is maintaining the /etc/resolv.dnsmasq.conf file and using the file as the symlink to /etc/resolv.conf, then they should probably check whether the nameservers in /etc/resolv.dnsmasq.conf is actually resolvable or not ( which is unlikely, because the diagnostic tool recommends otherwise ).

that’s exactly what happened in my case. I was saying that the system should not have deleted the app because it wasn’t the app’s fault. Then again, there is no way for an automated system to figure out whose fault it was unless it is intelligent ( but no thanks, I don’t want AI on my servers ).

tmb · May 26, 2020, 10:22pm

Yep, as the diagnostics only issues a warning it seems to allow for some other (working) custom configurations. But a quick check for a working name resolution (local and remote), before starting any actual upgrade attempt seems a good idea to me, as it is essential, can avoid that package removal if it only fails later, and most importantly allows to point the user to a specific problem. (Instead of just leaving behind a general mess )

Aleks · May 26, 2020, 10:48pm

Yes indeed, we can probably add something like this … On the other hand here my understanding is that the setup was not “nominal” and once you start drifting away from the nominal setup and tweak things manually, there’s just too many implicit and hidden assertions about what should work to check all of them …

Like, we can check that the hostname is correctly resolved locally … but that doesn’t prevent somebody from tweaking the dnsmasq conf to change the return value … so maybe the hostname is resolved, but to an incorrect value that will also trigger a similar issue to what happened … You’ll say “then just force people to not tweak things” but then somebody will come and complain “Why does Yunohost prevent me from tweaking how DNS resolution works !!?” (which already happened anyway)

But maybe that’s just me becoming so pessimist about automated system administration ¯\_(ツ)_/¯

shine · May 26, 2020, 11:48pm

No, that’s a totally valid emotion to have. I mean, if I was in your situation, I’d feel the same way. I can actually feel your pain too. I understand how demanding system administration can be

tmb · May 27, 2020, 12:03am

Maybe, just adding some actual name resolution tests to the resolv.conf symlink diagnostics (instead of to the updating process), and issuing an actual error if these diagnostics fail, can avoid that rabbit hole feeling.

tmb · May 27, 2020, 12:06am

But the diagnostics page won’t prevent the uninstalling if there is for example just some temporary external DNS server error. (Even with a plain default configuration.)

EDIT:
So, a small set of basic connectivity checks seem better to prevent that trivial but not uncommon errors can have unproportionally large, hard to diagnose, impacts.

system · June 11, 2020, 12:06am

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.