> However, a CT log failure earlier this year due to MariaDB corruption after disk space exhaustion provided the motivation for a change. PostgreSQL, with its robust Write-ahead Logging (WAL) and strict adherence to ACID (Atomicity, Consistency, Isolation, Durability) principles, made it a better option for avoiding corruption and improving data integrity of the log.
This sounds incredibly suspicious to me. I wonder which storage engine they were using or if there was any unusual configuration that contributed to this issue.
But it reads to me as _we_, not mariadb, made a big oopsie and TADA \o/ now we change software. So I do strongly hope that besides changing software, they added some disk space monitoring.
Only in buggy software that ignores errors from system calls.
Obviously you can expect availability problems when you run out of space, but there's no excuse for losing data from committed transactions given that the OS will reliably report the error.
> So I do strongly hope that besides changing software, they added some disk space monitoring.
Keep in mind the report doesn't mention what MariaDB version was being used, nor what storage engine. The description in the incident report simply doesn't align with how InnoDB behaves, at all. (InnoDB is the default storage engine in MariaDB and MySQL, and by far the most widely used storage engine in these DBs.)
For example, the author says "Whilst its primary key index still includes the ~1,300 unsequenced rows, the table itself no longer contains them." This is simply impossible in InnoDB, which uses a clustered index – the primary key B-tree literally is the table with InnoDB's design. The statement in the incident report is complete nonsense.
So maybe they were using MyISAM, which is ancient and infamously non-crash-safe and should pretty much never be used for anything of consequence, which has been well-known in the database world for 15+ years. Who knows, they didn't say.
The report directly says "Our teams have very little expertise with MySQL/MariaDB", which also makes me question their conclusion "that it is not possible" to restore/recover.
Also this whole incident happened to Sectigo, take from that what you will.
The one thing I'd criticize MariaDB for here is that they're lacking an additional safety mechanism which MySQL has: by default MySQL will shut down the database server if a write to the binary log cannot be successfully made, ensuring that any writes on the primary can be replicated as well. (iirc this was a contribution from Facebook.)
It's not about being bug-free, it's about being crash-safe. You should be able to crash your software - either by pulling power cord from the server, killing process or turning disk read-only (similar to what happens when it's full), and your database should still survive.
Data corruption like that should just not happen. You have your journaled filesystems, you have your raid, zfs pools, and whatnot; all that is worth nothing if your database software can just say "I have encountered an error during write operation and now your data is inconsistent, good luck". This is exactly what journaling / write ahead log / innodb doublewrite buffer should prevent.
There is an article "Crash-only software: More than meets the eye" on lwn.net if you would like to read more about it. The postgres fsync bug is also vaguely related to the same issue, also worth reading.
Completely agreed, but "the whole table is now lost beyond repair" is simply not how InnoDB behaves, at all! So the whole incident report is quite fishy. See my reply to the sibling comment.
Semi-relatedly, you can connect to the crt.sh Postgres instance and query it directly with SQL:
To generate the SQL queries in the web UI, simply click “advanced” and then the “Show SQL” checkbox, or append it to the URL, like so: (Note the generated SQL at the bottom of that page.)Steampipe also has a crt.sh connector: https://hub.steampipe.io/plugins/turbot/crtsh/tables/crtsh_c...
This title reads strange to me, as though postgres certificates, used when connecting using TLS, will be visible in the CTL.
Better would be: CTL can now use a postgres backend.
> However, a CT log failure earlier this year due to MariaDB corruption after disk space exhaustion provided the motivation for a change. PostgreSQL, with its robust Write-ahead Logging (WAL) and strict adherence to ACID (Atomicity, Consistency, Isolation, Durability) principles, made it a better option for avoiding corruption and improving data integrity of the log.
This sounds incredibly suspicious to me. I wonder which storage engine they were using or if there was any unusual configuration that contributed to this issue.
> a CT log failure earlier this year due to MariaDB corruption after disk space exhaustion provided the motivation for a change.
That seems like a rather serious bug. Disappointing there's not more follow up with MariaDB
They know about the problem but appear uninterested in improving the situation: https://mariadb.com/kb/en/database-corruption-and-data-loss-...
I don't really think your link supports that conclusion.
If you let a disk run full weird shit happens.
But it reads to me as _we_, not mariadb, made a big oopsie and TADA \o/ now we change software. So I do strongly hope that besides changing software, they added some disk space monitoring.
> If you let a disk run full weird shit happens.
Only in buggy software that ignores errors from system calls.
Obviously you can expect availability problems when you run out of space, but there's no excuse for losing data from committed transactions given that the OS will reliably report the error.
> So I do strongly hope that besides changing software, they added some disk space monitoring.
One of the action items in their incident report was improving monitoring and alerting: https://groups.google.com/a/chromium.org/g/ct-policy/c/038B7...
Keep in mind the report doesn't mention what MariaDB version was being used, nor what storage engine. The description in the incident report simply doesn't align with how InnoDB behaves, at all. (InnoDB is the default storage engine in MariaDB and MySQL, and by far the most widely used storage engine in these DBs.)
For example, the author says "Whilst its primary key index still includes the ~1,300 unsequenced rows, the table itself no longer contains them." This is simply impossible in InnoDB, which uses a clustered index – the primary key B-tree literally is the table with InnoDB's design. The statement in the incident report is complete nonsense.
Also if the storage volume runs out of space, InnoDB rolls back the statement and returns an error. See https://dev.mysql.com/doc/refman/5.7/en/innodb-error-handlin... (MySQL 5.7 doc, but this also aligns with MariaDB's behavior for this situation)
So maybe they were using MyISAM, which is ancient and infamously non-crash-safe and should pretty much never be used for anything of consequence, which has been well-known in the database world for 15+ years. Who knows, they didn't say.
The report directly says "Our teams have very little expertise with MySQL/MariaDB", which also makes me question their conclusion "that it is not possible" to restore/recover.
Also this whole incident happened to Sectigo, take from that what you will.
The one thing I'd criticize MariaDB for here is that they're lacking an additional safety mechanism which MySQL has: by default MySQL will shut down the database server if a write to the binary log cannot be successfully made, ensuring that any writes on the primary can be replicated as well. (iirc this was a contribution from Facebook.)
Please enlighten me with some software examples that are free of bugs :)
It's not about being bug-free, it's about being crash-safe. You should be able to crash your software - either by pulling power cord from the server, killing process or turning disk read-only (similar to what happens when it's full), and your database should still survive.
Data corruption like that should just not happen. You have your journaled filesystems, you have your raid, zfs pools, and whatnot; all that is worth nothing if your database software can just say "I have encountered an error during write operation and now your data is inconsistent, good luck". This is exactly what journaling / write ahead log / innodb doublewrite buffer should prevent.
There is an article "Crash-only software: More than meets the eye" on lwn.net if you would like to read more about it. The postgres fsync bug is also vaguely related to the same issue, also worth reading.
> Only in buggy software that ignores errors from system calls.
Is there any reason to conclude that is what happened?
Bad things can happen when your disk is full, but "the whole table is now lost beyond repair" is not reasonable for a database.
Completely agreed, but "the whole table is now lost beyond repair" is simply not how InnoDB behaves, at all! So the whole incident report is quite fishy. See my reply to the sibling comment.
Translation: we had data loss and had to change storage provider to one that works.
Where works means "does not eat data"... A very basic requirement for storage :)
Are there Merkle hashes between the rows in the PostgreSQL CT store like there are in the Trillian CT store?
Sigstore Rekor also has centralized Merkle hashes.
Isn’t Rekor runs on top of Trillian?