Server crash because of system partition turn to readonly mode

croulibri · December 18, 2023, 8:43am

I have been using Yunohost for more than 6 years on an OVH kimsufi server.
6 months ago, taking profit of hight speed internet at home, I installed the server at home on a Mini-PC Intel N5100 with 8GB DDR4 RAM with a SSD cruxial BX500 1To for /home and a 128GB M.2 SATA SSD for /

My Yunohost server is up to date (11.2.8.2.)
Friday night, I tried to install ddclient to update automatically IP in case my almost fix IP change, with Debian repository ddclient and following 🚀 Configurer ddclient avec DynDNS Infomaniak - Infomaniak

Saturday morning, my server is out of reach
when I try to connect to any service, I get “500 Internal Server Error - nginx”.
So WebUI and SSH connexion are not working anymore.

When I reboot, I can access through webUI or SSH for 1min-5min and then error again.
I ran diagnostic through webUI when connected that didn’t show any error.

I removed ddclient (sudo apt-get remove ddclient) but my server still doesn’t work as I face “404 Not Found - nginx”

I connected a screen and keyboard to my home server and I saw on the screen :

systemd-journald[242] failed to rotate /var/log/journal/....
systemd-journald [242] failed to write entry (9 items, 245bytes)..

My /home and my / are used less that 40% so I have plenty of space available.

Then when I restart the server and I check for log with journalctl --verify then I find :

8bff08: Invalid entry item (12/24 offset: 000000
8bff08: Invalid object contents: Bad message
File corruption detected at [/var/log/journal/1adefe4644714958b95a2bcdfc0a6bfd/system@00060ca1ce98826f-3f42f18a37d54f8f.journal~](mailto:/var/log/journal/1adefe4644714958b95a2bcdfc0a6bfd/system@00060ca1ce98826f-3f42f18a37d54f8f.journal~):8bff08 (of 16777216 bytes, 54%).
FAIL: [/var/log/journal/1adefe4644714958b95a2bcdfc0a6bfd/system@00060ca1ce98826f-3f42f18a37d54f8f.journal~](mailto:/var/log/journal/1adefe4644714958b95a2bcdfc0a6bfd/system@00060ca1ce98826f-3f42f18a37d54f8f.journal~) (Bad message)

I removed the corrupted log file.
then I do sudo systemctl restart systemd-journald

… but the server crash again

I restart, and following @Aleks ’ advices (on Matrix support room), I tried a relatively light backup that failed with:

Erreur: "500"
Action: "POST" /yunohost/api/backups
Retraçage
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/yunohost/[backup.py](http://backup.py)", line 1968, in backup
tar.add(path["source"], arcname=path["dest"])
File "/usr/lib/python3.9/[tarfile.py](http://tarfile.py)", line 1985, in add
File "/usr/lib/python3.9/[tarfile.py](http://tarfile.py)", line 1985, in add
File "/usr/lib/python3.9/[tarfile.py](http://tarfile.py)", line 1985, in add
[Previous line repeated 5 more times]
File "/usr/lib/python3.9/[tarfile.py](http://tarfile.py)", line 1979, in add
File "/usr/lib/python3.9/[tarfile.py](http://tarfile.py)", line 2007, in addfile
File "/usr/lib/python3.9/[tarfile.py](http://tarfile.py)", line 247, in copyfileobj
OSError: [Errno 5] Input/output error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/yunohost/[log.py](http://log.py)", line 410, in func_wrapper
File "/usr/lib/python3/dist-packages/yunohost/[backup.py](http://backup.py)", line 2283, in backup_create
backup_manager.backup()
File "/usr/lib/python3/dist-packages/yunohost/[backup.py](http://backup.py)", line 772, in backup
method.mount_and_backup()
File "/usr/lib/python3/dist-packages/yunohost/[backup.py](http://backup.py)", line 1705, in mount_and_backup
self.backup()
File "/usr/lib/python3/dist-packages/yunohost/[backup.py](http://backup.py)", line 1979, in backup
raise YunohostError("backup_creation_failed")
yunohost.utils.error.YunohostError: Impossible de créer l'archive de la sauvegarde
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.9/logging/[init.py](http://init.py)", line 1153, in close
File "/usr/lib/python3.9/logging/[init.py](http://init.py)", line 1063, in flush
OSError: [Errno 30] Read-only file system
During handling of the above exception, another exception occurred:
OSError: [Errno 30] Read-only file system
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/moulinette/interfaces/[api.py](http://api.py)", line 453, in process
File "/usr/lib/python3/dist-packages/moulinette/[actionsmap.py](http://actionsmap.py)", line 580, in process
File "/usr/lib/python3/dist-packages/yunohost/[log.py](http://log.py)", line 412, in func_wrapper
File "/usr/lib/python3/dist-packages/yunohost/[log.py](http://log.py)", line 678, in error
File "/usr/lib/python3/dist-packages/yunohost/[log.py](http://log.py)", line 707, in close
File "/usr/lib/python3.9/logging/[init.py](http://init.py)", line 1158, in close
OSError: [Errno 30] Read-only file system

So Aleks supected a hardware problem and advices me to do:

cat /proc/mounts that gives:

sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime,hidepid=invisible 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=3935792k,nr_inodes=983948,mode=755 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=790872k,mode=755 0 0
/dev/sda2 / ext4 rw,relatime,errors=remount-ro 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k 0 0
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate,memory_recursiveprot 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
none /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=10633 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0
/dev/sdb1 /home ext4 rw,relatime 0 0
/dev/sda1 /boot/efi vfat rw,relatime,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=mixed,utf8,errors=remount-ro 0 0
tmpfs /run/user/73306 tmpfs rw,nosuid,nodev,relatime,size=790868k,nr_inodes=197717,mode=700,uid=73306,gid=73306 0 0

The problem seems to be my system partition move to readonly mode leading to server crash. “The issue is understanding what error is happening exactly triggering the ro mode, though these are usually hardware issues.” Aleks said
So he advices me to look at some tips and further info in 12.10 - Ubuntu goes into read-only mode randomly - Ask Ubuntu and permissions - Filesystem suddenly read-only? - Unix & Linux Stack Exchange

I noticed that the served goes to readonly mode and is inaccessible not only when making a backup, but sometimes around 10-15min after a reboot.
Surprisingly, if I start a file download through Filezilla (SFTP/shh connexion), then even if the server crash and is inaccessible through console/ssh or webUI, the download through Filezilla can continue and complete.

I did test with smartctl and my 2 SSD seems in good condition without error arising.

I also did sudo fsck -Cy <your partition> on both partition (system and /home) without any issue being reported.

In addition, my problem is when the readonly mode happens, the screen is filled with journald error (fail to rotate / fail to write) and so I can’t write any command neither do any check.

How could I diagnose what crash my small server?

I could change one of the SSD (event if they are only one year old), but I found no evidence of disk failure so far…
(I hope not to be off topic…)

Mamie · December 18, 2023, 9:43am

The first thing I suggest is to make a full backup on a new disk (using YunoHost backup tools so you can restore a server later).

From what I’ve seen, if the partition with the logs are readonly, you won’t have any logs about what happened on the disk, BUT you’ll have them on screen. The problem is that there will be a lot of logs and maybe the root cause will not be visible unless you are literally filming the screen.

If the other partition switch to read only, you may have logs in /var/logs/syslog

My last point : disks can fail even if they are new, and everybody should have backups (ideally 2 backups, in 2 different locations, and you should regularly validate the restauration process)

Oh, and there is this app that can help for diagnostics on your disks : YunoHost app store | Scrutiny
(fsck gives me no errors, but this app have more info and one of my disks have a lot of errors, and it is the disk with the backups, so always have backups in 2 separate places )

cocoyuno · December 18, 2023, 2:16pm

Hi, I would agree it really looks like a hardware issue, especially your M2 SSD (cheap one??). A few comments:

My /home and my / are used less that 40% so I have plenty of space available.

These days there is log rotate and maximum sizes allocated to the journal files (as nicely explained here for example), so you will rarely end up with /var/log filesystem clogged up as it could happen in the past. But file corruption can happen, which you can check with journalctl --verify. This can take a little while and if all good, you should see PASS for all your files.

Also, your mount options could be improved… 1) I notice that you don’t do any filesystem check (last flag set to 0 means no check) and 2) for SSDs, adding noatime is a bit better than relatime, see for example debian wiki.

You can also check if SSD trim is being done without errors. Sorry I don’t have a live Yunohost system at the moment to double check but you should have a systemd service called fstrim or something like that. systemctl status fstrim should give you some clues. Not sure if failed trim could cause filesystem to be switched to read-only though.

Finally, if you can get your logs to work (??) then check for recent hardware drivers errors from the kernel with journalctl -k --since yesterday -p 3. If you want to follow kernel messages “live” then journalctl -k -p 3 -f (for example when you expect your filesystem to go read-only after boot)

cocoyuno · December 18, 2023, 2:18pm

Cheap SSDs on servers can be troublesome, although i am sure dozens of Yunohost users would say that they have been running Yunohost successfully for x years without problems.

croulibri · December 20, 2023, 9:48am

Thank you so much @Mamie for your advices.
I have full backup and I started to restore them on a temporary server to ensure continuity of service for my family
I am currently focusing on that to be sure everything works well for Christmas period.
I will install Scrutiny on the repaired server to make a better follow up of the disk health.
And yes I know that any equipment can fail, even if it is new. I just hoped it does not happen with me

Many thanks @cocoyuno for your suggestions.
Could you tell me what should I put on my mount option to improve filesystem check? The current set up is the standard one on a minimal Debian install + Yunohost script. I didn’t modify anything.
And I will modify with noatime
I will also check with fstrim and check for recent hardware drivers errors.

I think I have some information when I film the screen of this home server during crash. I have to do it again but I noticed at the very beginning of the crash
Buffer I/O error on device sdb2
And then a EXT4-fs error (device sdb2)
So it could mean that it is not my cheap M2 SSD (sda) that fails but rather the Cruxial BX500 SSD (sdb) I added to this server for the /home partition. Am I wrong?
This is to be confirmed by a new crash movie
If this is the case, the BX500 is still under warranty so I hope to get refund for this device.

cocoyuno · December 20, 2023, 11:12am

Mount options: 1 for the “/” filesystem and 2 for the others you want to get checked.
Your log files are on your system drive. Even if your sdb disk has problems, your sda disk should not go read-only. There is still something we don’t understand here I think…

croulibri · December 20, 2023, 5:14pm

Thank your @cocoyuno
This afternoon, my server is not booting anymore. From a live CD I find the sdb2 BX 500 SDD in good health (I check again with smartctl) but my sda1 M2 SDD is not visible anymore. It might have died. After only four month of regular use…

I think I will have finally to buy a better M2 SATA SSD…

cocoyuno · December 20, 2023, 6:05pm

Oh
I hope it’s “only” the SSD and not the system board doing weird things. You will see quickly if you change the system drive… good luck.

croulibri · December 21, 2023, 9:13pm

The company who sell these Trigkey Mini PC is sending me a new M.2 SSD as part of the warranty, so I will check if only the M.2 SSD was defective.

Thanks again for your advises and congrats again to the Devs of Yunohost who allow me (again) to reinstall my server without any problem from my last backup.

system · January 20, 2024, 9:14pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.