I've been trying to setup a kind of mini datawarehouse for my homelab, and I'm spoiled for choice. As I'm wont to do on this blog, I'm going to do a technical survey of the various options that seem interesting to me.
- Convenient Setup - installing and configuring and maintaining that config over time is just too hard. Let's use an off-the-shelf (base) docker image from a reputable source. We don't want to have lots of docker containers for this either. Ideally, just one container, not like 5 docker-compose services.
- General Purpose - we might be doing time series analysis, full text search, point queries, or crazy joining. I don't want to lock in right now nor use many databases. I want one single database to setup that can handle whatever kind of read load I throw at it at very low scale with okay performance
- Massive Ecosystem - I'd like to plugin this database into Superset, various custom glue code applications, and make the value of the data I collect as great as possible. Big ecosystems generally have lots of stackoverflow, good official docs, etc.
- Convenient high-frequency backup and restore - For a upcycled homelab situation, servers get unplugged, spontaneously restart, or just fail all the time. I want to leverage cloud storage for backup/restore to minimize data loss at rest. When I plug the server back in, I want to make sure I can easily recover as much data from before the power incident as possible.
So, while I keep telling myself I love NoSQL databases (and I think that's still true?), I'm really looking at the most well known databases here and their upstart/mostly compatible twins. Sorry Mongo and DynamoDB! As much I'd love to use Presto/Hive or even Clickhouse to be able to query directly against something like S3 or minio, all of those database options have far too many moving parts and would require 2-5 docker containers just for a hello world style beginning. Once we've windled down the list this far, then probably the ease & power of the backups really starts to look like the differentiating factor.
Based on the above criteria, I'll be looking at:
Flavors of Backup
It's really hard to compare the various backup options offered by different databases without having a detailed understanding of the various mechanisms for backup that exist and the inherent tradeoffs in each. Before diving into the databases, I'm going to try to compare and contrast these different styles of backup.
|Flavor||Trigger||Whole dataset?||Locking||FS Tweaks Needed|
|Logical Backup||Normally cron||Optionally||Table||No|
|File System Snapshot||Sys admin||Mandatory||Server||Yes|
What I'm calling "replication" here has many different flavors itself. For all flavors of replication, for all practical purposes, you must have one of the other forms of backup to anchor the changes since the last backup. Restoring then requires a sequence of commands of find the appropriate full backup, and then replay the replication data "on top of" the full backup.
Postgres leverages it's own Write-Ahead-Logs ("WAL") to enable consistent crash recovery, and it exposes the ability for server admins to automatically "archive" these WALs with arbitrary scripts. Those scripts could upload WALs to S3, etc. However, if uploading the WALs ever fails, it can cause postgres to become unavailable. If you give up on a WAL, then your backup will have lost data.
Replicating the logs to another server and forming a cluster is also a form of backup, in that it offers a "backup" server in case anything happens to the master. It can also be used to reduce extra load on a leader during a full backup, or even theoretically used to restore a new master. Since replication doesn't really allow us to ease do backup to a commodity cloud storage system, I'm going to rule out this kind of replication.
Mysql accomplishes it's replication to other servers using the binary log or "binlog" that contains all mutation-inducing queries. Like Postgres leveraging WALs for incremental backups, MySQL can be configured to leverage the binlog for incremental backups to a file. Unlike the WALs, MySQL is less vulnerable to network blips taking down the database server. However configuring and testing and monitoring the binlog.
While a PiT restore capability would be great to setup, this is an advanced features that won't work without some kind of full backup solution already in place.
Logical backups can dump everything in a database to a text file filled with SQL statements. At a later time or on a separate machine, those statements can be executed to restore the database to its state at the time of the backup. They are potentially large to store, but easy to restore and widely supported. Logical backups can be run remotely over a TCP or UDP connection in case we want to run an external cronjob or manually trigger this without SSH access. Conveniently, a full logical backup can run on selective tables of a given database.
In the case of MySQL,
mysqldump can even output CSV or other formats. Postgres and MariaDB do similar things. Technically, things can change while you run the backup, so its not guarranteed to be perfectly consistently.
You'll still need to setup a cronjob to run mysqldump and friends, then upload to S3...which means installing an S3 client and configuring that as well. Here's a prepackaged scheduled backup to s3 for Postgres that looks promising.
CockroachDB actually has full support for backups to S3 on a schedule baked in. The winner here is CockroachDB, since it's the simplest. Just run a command one time, and CockroachDB will make the backup on a schedule and upload it for you.
Postgres, MySQL and MariaDB all provide facilities for "snapshot" backup of the underlying file system with various caveats. The main problems are (1) a database server in motion might be snapshot at a bad moment that is inconsistent and (2) a database server might some key information in RAM, but not yet on disk.
To mitigate these issues, the client must lock the database and then
tar up the various files of interest while holding that lock in an ongoing connection. Finally, the client must release the connection. It might be possible to script that with bash, but it would annoying. I'm not sure why a tool that does this for you is not simply included in these database products.
Using similar techniques, file system level backups can also be taken using ZFS or BTRFS (or LVM too). However, those snapshots can't be exported to cloud storage.
I looked at
docker checkpoint as well, but this won't work because it only captures the running processes memory, not disk. Sadly
docker commit would probably also require downtime to backup the database, though in this case would be acceptable. However, the new docker image would still need to exported for cloud backup and would likely be even larger than a logical backup.
Based purely on the backup criteria, my small dataset, and some tolerance for data loss over complexity, I think I'm going to use the open source "core" CockroachDB's full backup schedule and just do full backups hourly with a 1-2 week retention.