RPi3 server freeze irregularly - help me find the cause

Jaxom99 · June 5, 2018, 12:42pm

(Version française sur demande.)

My YunoHost configuration

Hardware: Raspberry Pi 3B v2 (1Gb RAM), WDLabs HDD as main storage, SDCard on boot only
Internet access: ethernet at home (Freebox, 16Mbps down / 1Mbps up)
YunoHost version: stable - 2.7.12 for all parts.
Diagnosis Result: https://paste.swordarmor.fr/jaxom/KvnO
System: Raspbian 8.0 Jessie, installed from BootBerry

Description of my problem

My server have been running smoothly with Yunohost for more than a year, with few users (family, non-techie) and various apps :
- Dokuwiki
- Framagames
- Hextris
- I Hate Money
- Jirafeau
- Keeweb
- Mailman
- MiniDLNA
- Multi custom webapp
- Multi custom webapp
- Custom Webapp
- Custom Webapp
- Nextcloud
- OpenSondage
- Rainloop
- Redirect
- Redirect
- Tiny Tiny RSS
- Wallabag

Since one or two weeks, the server freezes completely, at various times, and is unreacheable : no webpages displayed, no ssh login, but it answers to ping on LAN. The only recovery option, as I have no external monitor/keyboard to plug it in, is a power off/on cycle.

I can not make out what goes wrong, as my monitoring (see charts below) shows no noticeable excess in CPU or memory consumption before the event. It appears that the server keeps some kind of activity, because logging is present (syslog, but no authlog).

I did not make any operation just before appearance of this problem. Configuration have been unchanged for a few months and this unstability is most recent.

My question is : where can I look to find the cause of this behaviour ?
Last documented event was last night : off-line started at 9:01pm and power cycle was performed at 0:42am. Here is the full syslog and authlog as a beggining. I have multiple subdomains (almost one per app) so I may upload them as needed. SSH auth is only allowed via public-key, on port 22.
Syslog : https://paste.swordarmor.fr/jaxom/6SZC&ln
Authlog : https://paste.swordarmor.fr/jaxom/jiJv&ln
(they are much verbose, 26k and 7k lines )

Thanks in advance to the community for the help, and please don’t hesitate to redirect me to relevant ressources elsewhere.

Aleks · June 5, 2018, 12:49pm

Hm those are always tricky to understand and solve …

The fact that it answers ping is kinda interesting though. I wonder if that could be something similar to what we discussed for an internet cube a few days ago. So one thing to try would be to add swap to the system, in case you don’t already have any.

One other thing we can look at is maybe “dmesg”, (I guess it’s kern.log ?) and see if it says something useful right when the event happened…

Aleks · June 5, 2018, 12:56pm

Okay, actually I didnt even had a look at the log but syslog is very informative.

Starting at 22:45:24, the kernel started to kill a lot (a lot ?) of processes because of lack of memory. Then for some reason it looks like the system got reboot ? (not sure if “properly rebooted” or if that’s like it violently crashed and rebooted)

Up to Jun 4 22:52:12 where it starts saying : Warning: /home/yunohost.backup/tmp/20170704-024337/data/home/<user> is no longer mounted. See http://wiki2.dovecot.org/Mountpoints

That’s weird actually, it’s talking about yunohost.backup/tmp but this is only used when creating backup.

Do you happen to have a script that does automatic backups ?

Jaxom99 · June 5, 2018, 12:59pm

Yeah, I tried to add swap, but turns out either I can’t read a tutorial or my system is weirder than I thought.
On raspbian, they (http://raspberrypimaker.com/adding-swap-to-the-raspberrypi/) say tu use dphys-swapfile but the system answers swapon: /var/swap.file: swapon failed: Input/output error. And another thread (but quite old) seemed to find this normal : https://www.raspberrypi.org/forums/viewtopic.php?p=500323

Will look into dmesg ASAP.

Jaxom99 · June 5, 2018, 1:04pm

Indeed, it seems to reboot in some kind of way, but without returning to a stable state. And the free memory of the system has been lower in the past, without those troubles. There seems to have others episodes of “lowmemorykiller” that doesn’t end up in a reboot. But maybe I’m asking too much of my RPi

I do not use yunohost backups in a automatic way, I only trigger it (via CLI or web) from time to time. I have a cron rsync distant backup, but it triggers at 2:40am so it seems unrelated to this.

Jaxom99 · June 5, 2018, 1:17pm

The kernel logs from the last two events are :

Last reboot : https://paste.swordarmor.fr/jaxom/WXFs&ln
The previous event (May 28 7:05pm to May 29 10:01am) : https://paste.swordarmor.fr/jaxom/nW38&ln

There are some “tasks blocked for more than 120 seconds” in both cases… php-fpm I get it (it’s both Nextcloud and TTRSS), but task sh:12211 is more cryptic…

Aleks · June 5, 2018, 1:45pm

Well you certainly seem to have a lot of apps installed for a RPi I’d say yes ;D But maybe that’s fine (at least the graphs you showed do not worry me, but I’m not a monitoring expert).

Imho you should definitely look into the swap thing, e.g. if you run free -h and have 0 swap, that’s kinda problematic.

Have you tried to set it up with something like this ? (adapted from here)

dd if=/dev/zero of=/swapfile bs=1024 count=1048576
mkswap /swapfile
swapon /swapfile
echo "/swapfile swap swap defaults 0 0" > /etc/fstab

Not sure why you get the Input/Output error … I usually see those kind of errors when disks are dying, but maybe here it’s just due to the file not being properly set up :s

Jaxom99 · June 5, 2018, 2:22pm

Thanks for looking into it. It kinds of reassure me that we reach the same conclusions (so far).
The info I gather (here) is that Raspbian Jessie uses dphys-swapfile to manage swap. The /etc/fstab specifically mentions :

# a swapfile is not a swap partition, no line here
# use dphys-swapfile swap[on|off] for that

Will keep searching…

DerekCaelin · June 6, 2018, 2:28am

Raspberry Pi B+ user here. I think I’m running into the same issue. My understanding of how all this works is pretty slim but I thought I’d mention that your issue isn’t isolated.

Jaxom99 · June 7, 2018, 2:09pm

Thanks for joining the thread
Maybe you could post your config and some logs as well (via a paste service), that would bring more data to the table to find the answer