Random system issues

oliv85559 · June 25, 2023, 11:09am

Hi everyone,

my server is running fine since a few months, except that every few days or weeks, some services stop working, for instance NextCloud, Wordpress, login over SSH… A reboot always fixes the issues. The weird thing is that not all services are always affected, and I cannot find any error message in the log files. I have been maintaining Linux servers for 20 years and I have never seen anything like that. The hardware is an Intel NUC. I did a complete memtest, which passed. I will try to find a replacement NUC to see if it helps, but I was wondering if someone has other ideas.

Cheers
Olivier

metyun · June 25, 2023, 12:18pm

Hi,

It could be an OOM killer. I’m not sure where you can find logs of OOM , but i think it is in /var/log/syslog. Search killed process or something similar.
If you find some OOM, you can try to expand swap or add ram (or investigate if there is a memory leak if there is enough ram)

oliv85559 · June 25, 2023, 12:35pm

It is a good hint, thanks. I checked the logs, I could not find anything related to OOM. The theory does not fit well with some of the symptoms anyway. For instance, when I have issues with SSH, the server is still running, but it refuses my SSH key. The issue is fixed after a reboot

oliv85559 · June 25, 2023, 12:47pm

There is something weird in the kernel logs though, but I do not know if it is related:

Jun 21 11:27:57 kernel: [   48.973113] IPv6: ADDRCONF(NETDEV_CHANGE): enp0s25: link becomes ready
Jun 22 00:55:37 kernel: [48508.986757] perf: interrupt took too long (2539 > 2500), lowering kernel.perf_event_max_sample_rate to 78750
Jun 22 08:49:48 kernel: [76960.944229] perf: interrupt took too long (3182 > 3173), lowering kernel.perf_event_max_sample_rate to 62750
Jun 22 22:35:26 kernel: [126498.882131] perf: interrupt took too long (3993 > 3977), lowering kernel.perf_event_max_sample_rate to 50000
Jun 24 13:26:26 kernel: [266360.796129] perf: interrupt took too long (4992 > 4991), lowering kernel.perf_event_max_sample_rate to 40000

arkadi · June 25, 2023, 1:44pm

these kinds of problems are the worst. have you thought of trying to install something like NetData? it might help you gather more data.

metyun · June 25, 2023, 3:30pm

In the past, my yunohost was installed on a RPI3 B+ and i had sames symptoms (lost ssh and the server still running). I resolved by increase the swap and use some options of sysctl in /etc/sysctl.d/99-sysctl.conf. These options are:

vm.swappiness=1
vm.overcommit_memory = 1
net.core.somaxconn=1024

Don’t forget to load this new settings.

oliv85559 · June 25, 2023, 7:55pm

These are good tips, thanks. I have installed NetData, it provides a lot of information…

I have also checked the memory setup:

free -h
               total        used        free      shared  buff/cache   available
Mem:           7.6Gi       1.1Gi       4.6Gi        58Mi       1.9Gi       6.2Gi
Swap:          979Mi          0B       979Mi

cat /proc/swaps
Filename  Type      Size   Used Priority
/dev/dm-2 partition 1003516 0     -2

lvdisplay
  --- Logical volume ---
  LV Path                /dev/myhost-vg/swap_1
  LV Name                swap_1
  VG Name                myhost-vg
  LV UUID                ...
  LV Write Access        read/write
  LV Creation host, time myhost, 2023-02-19 18:21:48 +0100
  LV Status              available
  # open                 2
  LV Size                980.00 MiB
  Current LE             245
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:2

cat /proc/sys/net/core/somaxconn
4096

I do not know why I have such a small swap. However, theoretically, with 8GB of RAM, the system should never have to swap, because it handles a very light load. But still, I will keep an eye on this.

arkadi · June 26, 2023, 12:24am

After running netdata for a bit i was getting errors about my swap too, so I created a bigger swap file. it’s been much better since then.

it’s good to have data!

pqc · July 2, 2023, 2:48pm

I have also experienced random unresponsiveness episodes, and a reboot would fix my issues too.

Thanks for the tip about NetData, @arkadi! I installed it, there is so much data, so many insights! I’m not sure entirely what I’m looking at or for. But my inbox has been overwhelmed with emails since I installed it, mainly related to memory, so that’s a starting point at least!

oliv85559 · July 2, 2023, 5:44pm

With NetData (CPU load) I found out that a process called “converter” was constantly started by the system, causing unnecessary load. It belongs to OnlyOffice. There was an error message in the log files:

Error: Configuration property "server.isAnonymousSupport" is not defined

Since I do not use OnlyOffice right now, I uninstalled it. I am not sure if it had something to do with my random issues, but at least now the CPU load is back to normal.

system · August 1, 2023, 5:45pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.