I've been reading up on and learning a lot about SQLite recently for some personal projects. I wanted to preserve a lot of the awesome tools and techniques I've learned over the past month or two before I forget.

Biggest Advantage - App and DB in One Process

Build apps with no separate database server infrastructure required. As an added bonus, use the sqlite3 cli to query the db directly even when not running the app.

This has lead to the common pattern of every Python app in existence using SQLite for the quickstart guide. Here's three other projects I'm aware of that go beyond a quick start to be powered exclusively by SQLite:

Drawback - No Elasticity

Because the database is really just a file, not a separate server that could be relocated to another host, you can only run one container/host/pod of your application server. Your application server must have enough CPU, memory and disk available to run the application, store all your data, execute queries, and other work the OS does in the background. It's impossible to separate out the database onto a different "physical" server or run multiple application servers. This means SQLite is generally not a good fit for many services you'd want to be able to scale in and out at your day job.

Workarounds for Lack of Elasticity

There's a few ways you can get around this limitation, but nothing will replace a separate database server for truly elastical applications. Here's a few tricks and tools:

  • DQLite - Distributed SQLite. I'm not really up to speed on this, but it seems to embed a special SQLite database that uses a Raft implementation to manage a cluster of SQLite databases, one of which is a leader. Kind of an interesting middle ground -- your application becomes stateful-ish
  • verneuil - if stale reads are acceptable, this project makes it possible to synchronously replicate to S3 and read back from S3 if local files are lost or unusable for some reason. The engineers combine this technique with sharding, described below.
  • Sharding - this is an old trick, but if your dataset can be easily and cleanly broken out by customer id, geographic region, or some other primary key in your data model, you could just run one separate SQLite per "shard". The issue is that you could never ever query across the shard without basically reimplementing DQLite. You can combine sharding with consistent hashing to minimize movement of records between shards when a new shard is added or removed.
  • NFS - theoretically, maybe, you could use a kubernetes persistent volume shared by all pods running your application to host the SQLite database. Or perhaps some other kind of network mounted file system (NFS), or just a docker volume plugin so that the file system itself is replicated instead of your database. I think this actually has a low probability of success because I don't think SQLite was made for this and likely you would undermine all the lovely guarrantees SQLite makes. Plus, even if it somehow worked, you would likely degrade your database performance pretty significantly.
  • Leader / Worker Pattern - okay, so imagine you have some really high CPU workload like rendering computer graphics, and all you need to do is provide an API to submit jobs and track the work. Rather than make all servers work the same way, make one "kind" of server (the "Leader" or "Coordinator") run a SQLite database to track the jobs state and where they are assigned, etc. Then make stateless "Workers" that actually do the CPU intensive work, and talk to the leader to store and query all state. While this system can provide serious elasticity, there's not that much state to track, and a single SQLite database and application might be sufficient to serve a decent scaled system.

Probably though, if you start outgrowing a single server to run your application, you may have outgrown SQLite.

Drawback - Potential Data Loss

Because you have to store your stateful database on the same "physical" server (really this could be bare metal, VM, or a container) as your application, if anything happens to your "physical server" then you also lose all the data too.

Workarounds for Potential Data Loss

The main ways to prevent data loss are backup or replication. Whereas backup is generally something that is done hourly or daily via an external cronjob, replication is something that generally happens continuously using internals of the database (such as the Write-Ahead-Log). Here's two replication systems:

  • LiteStream - side car process that asynchronously replicates to S3, can be separate binary or container
  • verneuil - a virtual file system ("vfs") loadable extension for SQLite that synchronously replicates to S3.

To my mind, LiteStream should in theory be the safer option because no matter what kind of downtime or connectivity issues you have reaching S3, the data on your local machine could always be backed up at a later point when connectivity is restored. Whereas, with verneuil I'd worry that if my connectivity to S3 was problematic, writes to my local disk might also fail causing data loss. I haven't examined either implementation to evaluate the quality or failure modes/guarrantees.

Additionally, SQLite itself ships with a few flavors of backup APIs. However, these backup APIs require your application to trigger the backup explicitly, rather than automatically and continuously replicating. Generally, I'd prefer to use replication because it means less code for me write and it minimizes data loss by staying very close to live writes. Having said that, backups are really amazing when you discover that the recent data is all wrong you and you need to go back 7 days to when you shipped a horrible bug. Anywho, here are some backup options:

  • Online Backup API - Really excellent docs here. Main advantage is minimal locking so that you can keep on using the database without getting frozen during backup
  • VACUUM INTO - VACUUM INTO does lock up the database significantly. However, it clears out all any leftover traces of old records and transactions so it guarrantees the smallest backup file size. Neat!

Want to provide a search functionality in your app, but don't want to run a specialized database? SQLite has you covered: Full Text Search Docs

Bonus Feature - Common Table Expressions (CTEs)

So, I always vaguely am aware that CTEs exist to make SQL queries simpler, but I've never fully grokked it. Here's a great intro to CTEs for sqlite blog post. But, really, the only data I really ever want to analyze is time series data, and it turns out CTEs that really help with times series too. The basics of the syntax are here.

Since, I really love doing time series work, it turns out there's actually a special index data structure available called R*Trees optimized for range queries.

Bonus Feature - JSON Support

I guess most all the popular databases have this now, but SQLite provides a loadable extension to make it possible to modify, query and aggregate fields from JSON valued columns.

Blobs

So, you can even have columns that have blobs of data in them (think photos or pdfs) and incrementally stream data into that column or out of that column. If you really wanted to get crazy, you could replace minio (sorta-kinda) with SQLite blobs.

Extensibility

Kind of insane, but you can write and dynamically load extensions into a running SQLite library if those extensions are written in (something compiles to) C's ABI. These extensions can add scalar and aggregations functions to the pre-existing SQL syntax.

Random Tools

Conclusion

What tools are you using to help you live with SQLite?