my server is running fine since a few months, except that every few days or weeks, some services stop working, for instance NextCloud, Wordpress, login over SSH… A reboot always fixes the issues. The weird thing is that not all services are always affected, and I cannot find any error message in the log files. I have been maintaining Linux servers for 20 years and I have never seen anything like that. The hardware is an Intel NUC. I did a complete memtest, which passed. I will try to find a replacement NUC to see if it helps, but I was wondering if someone has other ideas.
It could be an OOM killer. I’m not sure where you can find logs of OOM , but i think it is in /var/log/syslog. Search killed process or something similar.
If you find some OOM, you can try to expand swap or add ram (or investigate if there is a memory leak if there is enough ram)
It is a good hint, thanks. I checked the logs, I could not find anything related to OOM. The theory does not fit well with some of the symptoms anyway. For instance, when I have issues with SSH, the server is still running, but it refuses my SSH key. The issue is fixed after a reboot
There is something weird in the kernel logs though, but I do not know if it is related:
Jun 21 11:27:57 kernel: [ 48.973113] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s25: link becomes ready
Jun 22 00:55:37 kernel: [48508.986757] perf: interrupt took too long (2539 > 2500), lowering kernel.perf_event_max_sample_rate to 78750
Jun 22 08:49:48 kernel: [76960.944229] perf: interrupt took too long (3182 > 3173), lowering kernel.perf_event_max_sample_rate to 62750
Jun 22 22:35:26 kernel: [126498.882131] perf: interrupt took too long (3993 > 3977), lowering kernel.perf_event_max_sample_rate to 50000
Jun 24 13:26:26 kernel: [266360.796129] perf: interrupt took too long (4992 > 4991), lowering kernel.perf_event_max_sample_rate to 40000
In the past, my yunohost was installed on a RPI3 B+ and i had sames symptoms (lost ssh and the server still running). I resolved by increase the swap and use some options of sysctl in /etc/sysctl.d/99-sysctl.conf. These options are:
These are good tips, thanks. I have installed NetData, it provides a lot of information…
I have also checked the memory setup:
free -h
total used free shared buff/cache available
Mem: 7.6Gi 1.1Gi 4.6Gi 58Mi 1.9Gi 6.2Gi
Swap: 979Mi 0B 979Mi
cat /proc/swaps
Filename Type Size Used Priority
/dev/dm-2 partition 1003516 0 -2
lvdisplay
--- Logical volume ---
LV Path /dev/myhost-vg/swap_1
LV Name swap_1
VG Name myhost-vg
LV UUID ...
LV Write Access read/write
LV Creation host, time myhost, 2023-02-19 18:21:48 +0100
LV Status available
# open 2
LV Size 980.00 MiB
Current LE 245
Segments 1
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 254:2
cat /proc/sys/net/core/somaxconn
4096
I do not know why I have such a small swap. However, theoretically, with 8GB of RAM, the system should never have to swap, because it handles a very light load. But still, I will keep an eye on this.
I have also experienced random unresponsiveness episodes, and a reboot would fix my issues too.
Thanks for the tip about NetData, @arkadi! I installed it, there is so much data, so many insights! I’m not sure entirely what I’m looking at or for. But my inbox has been overwhelmed with emails since I installed it, mainly related to memory, so that’s a starting point at least!
With NetData (CPU load) I found out that a process called “converter” was constantly started by the system, causing unnecessary load. It belongs to OnlyOffice. There was an error message in the log files:
Error: Configuration property "server.isAnonymousSupport" is not defined
Since I do not use OnlyOffice right now, I uninstalled it. I am not sure if it had something to do with my random issues, but at least now the CPU load is back to normal.