Problems with the site - work ongoing (aka when databases go boom!)

CRCinAU · 5 January 2023 14:07

Hi all,

Today I noticed that the database that backs the main Marlin Firmware Service has suffered some corruption.

Working over the last 6 hours, we believe this has been traced back to a bug - or not quite correctly functioning IO when running a database and flushing to disk using the O_DIRECT method.

In theory, this should mean that any writes to the database be done and flushed directly to the disk to minimise any changes of corruption should something happen. In this case, it doesn’t look like that was working - and at some point, the database has been corrupted.

There are three corrupted tables:

User accounts
Firmware Builder - Save printer profiles
Donation history

I’ve restored just about all user accounts, however there is a possibility that some accounts were caught in the corrupted section of the table and were not able to be restored. If this affects you, please re-create your account with the same details, and any data that has been able to be recovered will automatically appear on your new account.

If you’re caught up in this (it should only be a very small percentage of people) send me a direct message and I’ll restore any donation privileges you had on your account.

I’m working to restore as many printer profiles as I can - currently, that sits at about 50% of saved profiles. I’ll try to get as many back as I can - but if your saved profiles currently don’t show - this is why. New profiles will save and work as they always have.

The donation history is purely informational - and have no bearing on any donations you’ve made - I try to store as little as possible with respect to information online - so the only thing affected is it displaying in your account. I’ll try to restore these as well - however new donations will show just fine.

The distributed firmware builders aren’t affected, nor are the nightly builds, or any custom firmware that’s still available and hasn’t expired as yet. This forum is also unaffected.

As far as stopping this from happening again? That’s a tough one.

The daily backups and dumps of database data failed silently at the beginning of the corruption - so when exporting, the data was exported fine - up until the first error. Then it silently failed and said the export was complete. This is obviously not ideal and needs to be improved.

As this failed silently, I don’t know when the actual corruption happened - so its hard to trace it back to a specific event. The database was still functioning correctly, was still answering queries, had no real errors when it happened to be looking at data that wasn’t in the corrupted areas. It also didn’t have any error logs until it happened to hit somewhere with corruption - which might be a single user account.

As to the further future - I’ve been talking to the MariaDB team - and firstly, in their upcoming version 11.2 of MariaDB, they plan to add at least a basic repair function to the InnoDB engine. While this won’t stop the corruption, it will make recovering what data is not corrupted an operation of a few minutes - not 6+ hours.

I’ve also passed on the complete details of the host machine, and software running, along with configurations to the MariaDB team - as they suspect that the problem lies within the Linux Kernel and the BTRFS sync system. Until this has been rectified, we’ve switched to using O_FSYNC instead of O_DIRECT - which should force a complete flush of any filesystem buffers on any database write. Hopefully, this will stop the root cause of this happening again - with a little performance decrease - but it shouldn’t really be noticeable to you guys.

I’d like to say thanks publicly to Monty - I believe part of the MariaDB team for the fantastic help in holding my hand on how to pick out as many records from the DB as I can to minimise data loss - as well as referring to the MariaDB developers for InnoDB to get help in how to recover data. I wouldn’t have had a chance to recover this much without your help. Thank you for your efforts.

So that’s what happened in a nutshell, thanks for reading, hope this tells you why the site has had its first actual unplanned downtime in over 2 1/2 years, and I hope you can all allow me some time to try and restore what I can. I apologise for any lost data that may well not be recoverable, but hopefully we don’t see this level of issue again.

Thanks again.

EDIT 2022-01-06 @ 22:22 AEST
I’ve now implemented a streaming replica of the database to a completely isolated system. This will store an entire copy of the database, on a different server in a completely different physical location. This should minimise data loss in the future in case this problem ever happens again. This is as well as the existing nightly DB backups, daily filesystem snapshots, and the daily run of entire system backups.