Random system issues

Hi everyone,

my server is running fine since a few months, except that every few days or weeks, some services stop working, for instance NextCloud, Wordpress, login over SSH… A reboot always fixes the issues. The weird thing is that not all services are always affected, and I cannot find any error message in the log files. I have been maintaining Linux servers for 20 years and I have never seen anything like that. The hardware is an Intel NUC. I did a complete memtest, which passed. I will try to find a replacement NUC to see if it helps, but I was wondering if someone has other ideas.



It could be an OOM killer. I’m not sure where you can find logs of OOM , but i think it is in /var/log/syslog. Search killed process or something similar.
If you find some OOM, you can try to expand swap or add ram (or investigate if there is a memory leak if there is enough ram)

It is a good hint, thanks. I checked the logs, I could not find anything related to OOM. The theory does not fit well with some of the symptoms anyway. For instance, when I have issues with SSH, the server is still running, but it refuses my SSH key. The issue is fixed after a reboot :face_with_raised_eyebrow:

There is something weird in the kernel logs though, but I do not know if it is related:

Jun 21 11:27:57 kernel: [   48.973113] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s25: link becomes ready
Jun 22 00:55:37 kernel: [48508.986757] perf: interrupt took too long (2539 > 2500), lowering kernel.perf_event_max_sample_rate to 78750
Jun 22 08:49:48 kernel: [76960.944229] perf: interrupt took too long (3182 > 3173), lowering kernel.perf_event_max_sample_rate to 62750
Jun 22 22:35:26 kernel: [126498.882131] perf: interrupt took too long (3993 > 3977), lowering kernel.perf_event_max_sample_rate to 50000
Jun 24 13:26:26 kernel: [266360.796129] perf: interrupt took too long (4992 > 4991), lowering kernel.perf_event_max_sample_rate to 40000

these kinds of problems are the worst. have you thought of trying to install something like NetData? it might help you gather more data.

In the past, my yunohost was installed on a RPI3 B+ and i had sames symptoms (lost ssh and the server still running). I resolved by increase the swap and use some options of sysctl in /etc/sysctl.d/99-sysctl.conf. These options are:

vm.overcommit_memory = 1

Don’t forget to load this new settings.

These are good tips, thanks. I have installed NetData, it provides a lot of information…

I have also checked the memory setup:

free -h
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       1.1Gi       4.6Gi        58Mi       1.9Gi       6.2Gi
Swap:          979Mi          0B       979Mi

cat /proc/swaps
Filename  Type      Size   Used Priority
/dev/dm-2 partition 1003516 0     -2
  --- Logical volume ---
  LV Path                /dev/myhost-vg/swap_1
  LV Name                swap_1
  VG Name                myhost-vg
  LV UUID                ...
  LV Write Access        read/write
  LV Creation host, time myhost, 2023-02-19 18:21:48 +0100
  LV Status              available
  # open                 2
  LV Size                980.00 MiB
  Current LE             245
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:2
cat /proc/sys/net/core/somaxconn

I do not know why I have such a small swap. However, theoretically, with 8GB of RAM, the system should never have to swap, because it handles a very light load. But still, I will keep an eye on this.

After running netdata for a bit i was getting errors about my swap too, so I created a bigger swap file. it’s been much better since then.

it’s good to have data!

I have also experienced random unresponsiveness episodes, and a reboot would fix my issues too.

Thanks for the tip about NetData, @arkadi! I installed it, there is so much data, so many insights! I’m not sure entirely what I’m looking at or for. But my inbox has been overwhelmed with emails since I installed it, mainly related to memory, so that’s a starting point at least! :smile:

With NetData (CPU load) I found out that a process called “converter” was constantly started by the system, causing unnecessary load. It belongs to OnlyOffice. There was an error message in the log files:

Error: Configuration property "server.isAnonymousSupport" is not defined

Since I do not use OnlyOffice right now, I uninstalled it. I am not sure if it had something to do with my random issues, but at least now the CPU load is back to normal.

