Nextcloud upgrade: how to shrink oc_filecache.ibd (441GB at the moment)

wbk · January 29, 2021, 9:25am

Hi all,

TL;DR
The upgrade of Nextcloud from 18 to 20 takes much longer than I anticipated. I think the cause is the size of the filecache, combined with HDD media. Can that cache be cleared prior to upgrade?

My YunoHost server

Hardware: Container on Proxmox, NAS-like box
YunoHost version: just upgraded from 3.8 to 4(.1?)
I have access to my server : Through SSH | through the webadmin : webadmin is ‘busy’ atm | direct access via keyboard / screen if necessary
Are you in a special context or did you perform some particular tweaking on your YunoHost instance ? : no

Description of my issue

In the Proxmoxbox is an SSD and some HDD, I moved Yunohost from SSD to HDD because the SSD was not large enough to hold all Nextcloud files.

That was before upgrading: the SSD was full enough that I worried about backup space while performing the upgrade from YNH3.8 to YNH4+. That upgrade went, as far as I can see, without trouble (matrix-synapse needed some help, and a downgrade of an SSL package, both documented by others on the forum).

After upgrading YNH, most apps would upgrade. Nextcloud was one that ended in an error, mysqldump: Got errno 28 on write. YNH got some 20G of free space when starting the upgrade, so not very little, but also not a lot.

I extended the disk space by 150G to have some headroom before restarting the upgrade last night. This morning there was an nginX gateway timeout on the webinterface, I could not reconnect. Yunohost commands via SSH would just seem to hang, while the system was not very busy (low CPU and low RAM usage).

The YNH-api log ends in:

# tail -f yunohost-api.log
2021-01-29 00:05:22,345 DEBUG    yunohost.hook <lambda> - [331.45] 24006 + ynh_mysql_dump_db --database=nextcloud
2021-01-29 00:05:22,345 DEBUG    yunohost.hook <lambda> - [331.45] 24006 + local legacy_args=d
2021-01-29 00:05:22,345 DEBUG    yunohost.hook <lambda> - [331.45] 24006 + args_array=([d]=database=)
2021-01-29 00:05:22,345 DEBUG    yunohost.hook <lambda> - [331.45] 24006 + local -A args_array
2021-01-29 00:05:22,346 DEBUG    yunohost.hook <lambda> - [331.45] 24007 + local database
2021-01-29 00:05:22,346 DEBUG    yunohost.hook <lambda> - [331.45] 24007 + ynh_handle_getopts_args --database=nextcloud
2021-01-29 00:05:22,346 DEBUG    yunohost.hook <lambda> - [331.45] 24007 + set +o xtrace
2021-01-29 00:05:22,346 DEBUG    yunohost.hook <lambda> - [331.45] 24007 ++ cat /etc/yunohost/mysql
2021-01-29 00:05:22,547 DEBUG    yunohost.hook <lambda> - [331.45] 24208 + mysqldump --user=root --password=Ex7eziT2uW --single-transaction --skip-dump-date nextcloud
2021-01-29 00:05:22,547 DEBUG    yunohost.hook <lambda> - [331.45] 24208 [########++++........] > Backing up the MySQL database...

At first I thought the process had crashed, but comparing disk usage over time, something was still running:

# # before upgrade 
# df -hTt ext4
Filesystem                            Type  Size  Used Avail Use% Mounted on
/dev/mapper/yunostmp-vm--104--disk--0 ext4  1.1T  837G  179G  83% /
# # this morning 
# df -hTt ext4
Filesystem                            Type  Size  Used Avail Use% Mounted on
/dev/mapper/yunostmp-vm--104--disk--0 ext4  1.1T 910G  107G  90% /
# # a while later 
# df -hTt ext4
Filesystem                            Type  Size  Used Avail Use% Mounted on
/dev/mapper/yunostmp-vm--104--disk--0 ext4  1.1T  922G   94G  91% /

iotop gave quite high disk activity, writing a few MB/s.

That let me check the MySQL files,

# du -hs /var/lib/mysql/nextcloud
438G    /var/lib/mysql/nextcloud
# ls -hs /var/lib/mysql/nextcloud| grep G
total 438G
438G oc_filecache.ibd

That explains a lot! It also explains where all diskspace went in the first place (Nextcloud is mosty filled by snapshots and movies from our telephones, and syncs with some laptop directories, all in all <0,5TB of 1TB SSD)

Looking at the size of this file (400G+), the orinal free disk space (200G-) and the I/O and write speed of the spinning HDD (~5MB/s), it will take the rest of the day before the backup ends in “out of free space”.

What would be the best way forward now? I see these options:

Do nothing, Nextcloud 18 is quite OK!
Mount the (now almost empty…) SSD at the right place for backups (/home/yunohost.backup/tmp/nextcloud-pre-upgrade1/apps/nextcloud/backup ?) so only reading MySQL is from the slow disk, but writing is to another, faster disk
Extend the space for YNH so that it got enough room for the backup
Kill the upgrade and clear the cache before upgrading

The first option is easiest but least interesting. The last option has my preference. I am not sure of the impact of killing the upgrade (is it handled gracefully? My experience with small upgrades: yes, it is) and how to clear the Nextcloud filecache (what I found as first results, is that cache only can be cleared for deleted files, if at all)

Thanks for any pointers!

wbk · January 30, 2021, 12:35pm

Update
From:

I went with option 3, extending diskspace to just under 1.5T, and let the backup continue. This morning the backup was done, and the upgrade continued.
What I had not anticipated:

2021-01-30 10:02:14,736 DEBUG    yunohost.hook <lambda> - [331.45] 122236349 + echo '[####################] > Backup script completed for nextcloud. (YunoHost will then actually copy those files to the archive).'

I was not at home, and only read it afterrwards;

2021-01-30 10:04:08,501 DEBUG    yunohost.hook <lambda> - [331.45] 122350117 Creating a backup archive from the collected files...
2021-01-30 10:04:08,502 DEBUG    yunohost.hook <lambda> - [331.45] 122350117 Creating the backup TAR archive...
2021-01-30 10:04:08,502 DEBUG    yunohost.hook <lambda> - [331.45] 122350117 Not enough space at /home/yunohost.backup/archives (free: 225387286528 / needed: 318437877361)
2021-01-30 10:04:08,803 DEBUG    yunohost.hook <lambda> - [331.45] 122350407 action [66655.1] executed in 122350.054s
2021-01-30 10:04:08,804 WARNING  yunohost.hook <lambda> - [331.45] 122350408 Not enough free space on '/home/yunohost.backup/archives'
2021-01-30 10:04:08,905 DEBUG    yunohost.hook <lambda> - [331.45] + '[' 1 -eq 0 ']'
2021-01-30 10:04:08,905 WARNING  yunohost.hook <lambda> - [331.45] Backup failed, the upgrade process was aborted.

So, instead of option 3 I got option 4-1 at the moment.

In the meantime I read some more stories about Nexctloud’s cache. The size can be attributed to external storage.

I’ll try shrinking the cache one way or another (to be continued…) before upgrading again.

Some observations for other Yunohost/Nextcloud users:

While the upgrade was running, we could continue using Nextcloud without a problem. Super!
Now the upgrade was cancelled, we’re back/still at Nextcloud 18, with no (negative) impact. Also good!
The upgrade was running in the background. I started it from the web-adminpage, but that got a timeout somewhere during the first night. The upgrade is robust enough to continue while disconnected from the web interface. Bonus again

Any hints for clearing the cache are very welcome. I’ll update the story once I got any further!

wbk · January 30, 2021, 2:26pm

This nut is a bit harder to crack than I had thought

I ran select count(*) from oc_filecache while reading ways to shrink the cache. After an hour it did not return an answer.

Now I use phpmyadmin, which can give a recordcount in a fraction of a second. It got ~321 million rows.

Now I’ve been running sudo -u nextcloud php /var/www/nextcloud/occ files:cleanup for 10 minutes. I am not quite sure it has the effect I was hoping for. During that same time, the number of rows has been growing from 321860098 to 321865269. It is only a small change, but I had hoped for fewer rows, instead of 5000 more!

I think I will disconnect the remote storage and disable the thumbnail/preview-generator-app (I’m not sure I got it, can’t find it at the moment…) , reboot the server (to aggressively give free memory used while making the backup and doing count(*) ) and run occ files:cleanup and see what it can do then.

wbk · January 30, 2021, 2:44pm

Unfortunately no way to disable the storage configurations, so set them to read-only and no shares/previews.

One way or another the (local storage) ‘Multimedia’ and ‘Shared multimedia’ mountpoints got multiplicated; I got four of each, deleted three. One other external storage (SFTP) got duplicated, deleted the bottom one.
The /home/$user mountpoint did not get duplicated, the same for WebDAV-mountpoints.

In the meantime the number of rows has risen to 321885972, something is very busy creating new rows (occ files:cleanup is still running, I’d suppose that suppresses the growth a bit).

wbk · January 30, 2021, 3:19pm

Weeeelll…

I started reboot cautiously by first running service mysql stop, but that keeps running quite a bit.

I figured maybe shutting down nginX would prevent running web apps from hammering MySQL, so shut that down, and only then thought there’s maybe a way to tell Nextcloud to graciously shutdown.

I think I should have started with runningsudo -u nextcloud php occ maintenance:mode --on, then stop nginX, then stop MySQL, but now I’ll just wait it out.

At the moment:

root@online:/var/www/nextcloud# service mysql status
● mariadb.service - MariaDB 10.3.27 database server
   Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
   Active: deactivating (final-sigterm) (Result: timeout) since Sat 2021-01-30 14:51:08 UTC; 26min ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
 Main PID: 20846 (mysqld)
   Status: "Shutdown in progress"
    Tasks: 23 (limit: 38288)
   Memory: 2.5G
   CGroup: /system.slice/mariadb.service
           └─20846 /usr/sbin/mysqld

Jan 28 21:35:45 online mysqld[20846]: 2021-01-28 21:35:45 1 [ERROR] InnoDB: preallocating 393216 bytes for file ./ffsync/batch_upload_items.
Jan 28 21:35:45 online mysqld[20846]: 2021-01-28 21:35:45 3 [ERROR] InnoDB: preallocating 393216 bytes for file ./ffsync/batch_upload_items.
Jan 30 12:57:08 online mysqld[20846]: 2021-01-30 12:57:08 42947 [Warning] Access denied for user 'mysql'@'localhost' (using password: NO)
Jan 30 14:51:08 online systemd[1]: Stopping MariaDB 10.3.27 database server...
Jan 30 14:51:08 online mysqld[20846]: 2021-01-30 14:51:08 0 [Note] /usr/sbin/mysqld (initiated by: unknown): Normal shutdown
Jan 30 14:51:09 online mysqld[20846]: 2021-01-30 14:51:09 0 [Note] Event Scheduler: Purging the queue. 0 events
Jan 30 14:51:09 online mysqld[20846]: 2021-01-30 14:51:09 0 [Note] InnoDB: FTS optimize thread exiting.
Jan 30 14:51:29 online mysqld[20846]: 2021-01-30 14:51:29 0 [Warning] /usr/sbin/mysqld: Forcing close of thread 37956  user: 'nextcloud'
Jan 30 14:51:29 online mysqld[20846]: 2021-01-30 14:51:29 0 [Warning] /usr/sbin/mysqld: Forcing close of thread 34799  user: 'nextcloud'
Jan 30 15:06:08 online systemd[1]: mariadb.service: State 'stop-sigterm' timed out. Skipping SIGKILL.
root@online:/var/www/nextcloud#

‘Skipping SIGKILL’ sounds good for a database, but ‘timed out’ not so good for getting to a next step :-/

wbk · January 30, 2021, 3:44pm

Ok, it is getting somewhere, but … not really where I want to be, I think:

Jan 30 15:09:00 online systemd[1]: Starting Clean php session files...
Jan 30 15:09:01 online CRON[174911]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Jan 30 15:09:09 online systemd[1]: phpsessionclean.service: Succeeded.
Jan 30 15:09:09 online systemd[1]: Started Clean php session files.
Jan 30 15:10:43 online systemd[1]: Stopping A high performance web server and a reverse proxy server...
Jan 30 15:10:43 online systemd[1]: nginx.service: Succeeded.
Jan 30 15:10:43 online systemd[1]: Stopped A high performance web server and a reverse proxy server.
Jan 30 15:12:54 online postfix/smtpd[175082]: connect from unknown[141.98.10.136]
Jan 30 15:12:55 online postfix/smtpd[175082]: disconnect from unknown[141.98.10.136] ehlo=1 auth=0/1 quit=1 commands=2/3
Jan 30 15:14:47 online dovecot: imap-login: Disconnected (no auth attempts in 10 secs): user=<>, rip=192.241.224.157, lip=80.127.182.180, TLS handshaking: Connection closed, session=<6oXalB+6wKHA8eCd>
Jan 30 15:15:01 online CRON[175105]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jan 30 15:16:15 online postfix/anvil[175088]: statistics: max connection rate 1/60s for (smtp:141.98.10.136) at Jan 30 15:12:54
Jan 30 15:16:15 online postfix/anvil[175088]: statistics: max connection count 1 for (smtp:141.98.10.136) at Jan 30 15:12:54
Jan 30 15:16:15 online postfix/anvil[175088]: statistics: max cache size 1 at Jan 30 15:12:54
Jan 30 15:17:56 online postfix/smtpd[175133]: connect from unknown[141.98.10.143]
Jan 30 15:17:56 online postfix/smtpd[175133]: disconnect from unknown[141.98.10.143] ehlo=1 auth=0/1 quit=1 commands=2/3
Jan 30 15:20:01 online CRON[175150]: (nextcloud) CMD (/usr/bin/php7.3 -f /var/www/nextcloud/cron.php)
Jan 30 15:20:41 online slapd[687]: slap_global_control: unrecognized control: 1.3.6.1.4.1.42.2.27.8.5.1
Jan 30 15:20:41 online nslcd[21787]: [e21002] <authc="admin"> cn=admin,dc=yunohost,dc=org: Invalid credentials
Jan 30 15:20:53 online slapd[687]: slap_global_control: unrecognized control: 1.3.6.1.4.1.42.2.27.8.5.1
Jan 30 15:20:53 online nslcd[21787]: [7d63f4] <authc="admin"> cn=admin,dc=yunohost,dc=org: Invalid credentials
Jan 30 15:21:04 online slapd[687]: slap_global_control: unrecognized control: 1.3.6.1.4.1.42.2.27.8.5.1
Jan 30 15:21:04 online nslcd[21787]: [ec11b2] <authc="admin"> cn=admin,dc=yunohost,dc=org: Invalid credentials
Jan 30 15:21:08 online systemd[1]: mariadb.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Jan 30 15:21:08 online systemd[1]: mariadb.service: Failed with result 'timeout'.
Jan 30 15:21:08 online systemd[1]: Stopped MariaDB 10.3.27 database server.
Jan 30 15:21:16 online postfix/anvil[175136]: statistics: max connection rate 1/60s for (smtp:141.98.10.143) at Jan 30 15:17:56
Jan 30 15:21:16 online postfix/anvil[175136]: statistics: max connection count 1 for (smtp:141.98.10.143) at Jan 30 15:17:56
Jan 30 15:21:16 online postfix/anvil[175136]: statistics: max cache size 1 at Jan 30 15:17:56
Jan 30 15:21:16 online slapd[687]: slap_global_control: unrecognized control: 1.3.6.1.4.1.42.2.27.8.5.1
Jan 30 15:21:16 online nslcd[21787]: [da00ed] <authc="admin"> cn=admin,dc=yunohost,dc=org: Invalid credentials
Jan 30 15:21:27 online slapd[687]: slap_global_control: unrecognized control: 1.3.6.1.4.1.42.2.27.8.5.1
Jan 30 15:21:27 online nslcd[21787]: [209072] <authc="admin"> cn=admin,dc=yunohost,dc=org: Invalid credentials
Jan 30 15:22:36 online mysqld[20846]: 2021-01-30 15:22:36 0 [Note] InnoDB: Starting shutdown...
Jan 30 15:22:36 online mysqld[20846]: 2021-01-30 15:22:36 0 [Note] InnoDB: Dumping buffer pool(s) to /var/lib/mysql/ib_buffer_pool
Jan 30 15:22:36 online mysqld[20846]: 2021-01-30 15:22:36 0 [Note] InnoDB: Instance 0, restricted to 2048 pages due to innodb_buf_pool_dump_pct=25
Jan 30 15:22:36 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:22:36 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:22:36 online mysqld[20846]: 2021-01-30 15:22:36 0 [Note] InnoDB: Buffer pool(s) dump completed at 210130 15:22:36
Jan 30 15:22:36 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:22:36 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:22:37 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:22:37 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:22:37 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:22:37 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:23:07 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:23:07 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:23:07 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:23:07 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:23:07 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:23:10 online mysqld[20846]: 2021-01-30 15:23:10 0 [Note] InnoDB: Shutdown completed; log sequence number 893008838103; transaction id 586035935
Jan 30 15:23:10 online mysqld[20846]: 2021-01-30 15:23:10 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
Jan 30 15:23:10 online mysqld[20846]: 2021-01-30 15:23:10 0 [Note] /usr/sbin/mysqld: Shutdown complete
Jan 30 15:23:10 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known
Jan 30 15:23:10 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known

Especially failed mode sounds not OK

Jan 30 15:21:08 online systemd[1]: mariadb.service: State 'stop-final-sigterm' timed out. Skipping SIGKILL. Entering failed mode.
Jan 30 15:21:08 online systemd[1]: mariadb.service: Failed with result 'timeout'.

and I don’t know what to think of

Jan 30 15:22:36 online systemd[1]: mariadb.service: Got notification message from PID 20846, but reception only permitted for main PID which is currently not known

Any suggestions ?

wbk · February 1, 2021, 6:57am

Update on the story:

The symptoms above turned out ‘not a problem’, ie, maybe symptoms of regular shutdown.
After rebooting, everything was running normally.
I’ll leave the post as-is, in case someone searches for similar symptoms

After rebooting I started
sudo -u nextcloud php /var/www/nextcloud/occ files:cleanup
It has been chugging away at removing rows.

The total number of rows in the table was just now 321112967. The highest row-id is 413696392:

MariaDB [nextcloud]> select max(fileid) from oc_filecache;
+-------------+
| max(fileid) |
+-------------+
|   413696392 |
+-------------+

Just now, row 24692024 was being deleted:

MariaDB [nextcloud]> show processlist;
....
DELETE FROM `oc_filecache` WHERE `fileid` = '24692024'
....

The deletes have been in ascending order of fileid, so very roughly the current delete transates to having scanned up to rownumber 25M of 400M rownumbers in a day and a half. Extrapolating, the cleanup will run about 24 days.

While running, the total number of rows dropped from 321733570 to 321112967, so 620603 rows have been deleted. Of those, 100000 I have deleted manually (but maybe not correctly), so the cleanup took care of some 0,5M of 300M while scanning a guestimated 7% of the rows. Extrapolating that would result in fewer than 10M rows deleted.

The rows I deleted manually where oc_filecache records where the ‘size’-field said ‘0’. Seems a cache of little importance, but still taking a row. Those 100000 deleted where lower ID’s than those deleted by the cleanup, so the cleanup does not regard filesize=0 as a reason to delete the cache.

Anyway, that’s the story up to now

wbk · February 3, 2021, 10:57pm

In the mean time, the cleanup is still running. It is at about record no 81040333 (a minute ago).

Now and again I drop in a manual delete, like

MariaDB [nextcloud]> delete from oc_filecache  where storage not in (select id from oc_storages) and fileid > 100000000  limit 20000 ;
Query OK, 20000 rows affected, 65535 warnings (4 min 58.144 sec)

That way I got down to 318983075 records by now.

Recreating the cache is probably faster than removing row by row, but I have not yet found out what I break by just truncating the table.

wbk · February 5, 2021, 4:03pm

Ah, bummer.

After running close to a week, the combined forces of occ files:cleanup and manual deletes dropped the number of records to 317 million.

Instead of manually running the delete with a subselect, I tried running with a fixed value, ie,

MariaDB [nextcloud]> delete from oc_filecache  where storage = 22 and fileid > 100000000  limit 100000 ;
Query OK, 100000 rows affected (8 min 21.590 sec)

That ran a lot faster.

So I canceled the occ files:cleanup, to continue with just manual deletes. Unfortunately, the cleanup seems to be a single transaction. I’m back at 321 million records now. Didn’t lose too many superfluous records, gained some experience

Edit: I got the storage id for starting by running

MariaDB [nextcloud]> select  storage from oc_filecache where storage not in (select id from oc_storages) and fileid > 106000000 limit 1;
+---------+
| storage |
+---------+
|      22 |
+---------+
1 row in set, 32 warnings (0.050 sec)

Once I’m through the dataset with ID 22, I’ll run the select again for the next storage ID that does not exist anymore, and run manual deletes with that ID.

arthurlutz · February 15, 2021, 7:35pm

Hi @wbk,

Haven’t read the details of your problems as I’m not very familiar with nextcloud nor mysql (simple user here), but have you tried getting support from nextcloud directly ? https://help.nextcloud.com/ might be a good place to look ? From what I read in a diagonal your problems might not be related to yunohost ?

wbk · February 15, 2021, 10:15pm

Hi Arthur,

Thanks for your suggestion! I’m also a simple user

Yes, I also posted on the Nextcloud forum after a while. I didn’t cross reference yet, as I don’t have a solution there either.

I think / am quite sure it is not is Yunohost.

I have removed all external storage, the database is not growing anymore, but not shrinking either. I’ll probably bite the bullet and extend storage space, so that there is a TB extra space available for the backup and the upgrade.

Maybe next weekend, or so

system · March 2, 2021, 10:15pm

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.