Josh Barratt's Blog

A collection of uploaded thoughts

Really Serverless Databases July 5, 2018

At Re:Invent 2017, AWS announced Aurora Serverless. This is a very exciting evolution of database technology – all the value of a scalable, relational store – but it scales to zero while you’re not using it. This brings “real” databases within reach for intermittently used applications where the costs of running them full time wouldn’t be justified, and especially will be excellent for test environments.

The announcement got me thinking about something else, though. I wondered if there were certain types of databases which would work well if directly embedded in lambda functions. And, after thinking about it some more, I think there are.

Really Serverless Databases

So, by ‘really serverless’, I mean “lambda-embedded databases.” Or “function-as-a-service-embedded databases”, to make it generic.

serverless databases

Why would you want this?

  1. Lambda functions are extremely scalable. You can run 1,000 of them concurrently without even nicely asking AWS to raise the limit. Databases are well known to be extremely unscalable, at least without lots of care, feeding, and spending.
  2. Embedded databases should have great performance, as there would be no network hop or serialization/deserialization overhead to get to the data. That’s the case even if it would be a single request/response cycle, let alone if you have to do multiple queries.
  3. Fewer moving pieces means fewer things that can go wrong; if your application can use just “a lambda”, that’s probably better than “a lambda and a database.”

I couldn’t find much prior art for people doing this from within lambda functions, but there are great success stories about using embedded datastores. A serverless embedded datastore just takes that power to the next level!

Constraints

So, what are the constraints to consider?

  • The data must be relatively infrequently updated. You probably could upload a new function every minute, but you probably don’t want to. (Interestingly, though, I don’t see that updating functions is a billable event, so maybe you do want to!)
  • Read-only data would be ideal; though you could handle user updated information by pushing it to another service, it’d break “read after write” consistency in most cases.
  • The entire payload (including code and whatever data is being stored) should be under 250MB unzipped, and ideally 50MB zipped. (Though apparently if you upload the zip to S3 yourself, the 50MB limit is waived.)
  • The lower the memory profile, the better, as lambda is billed on the basis of “memory used per 100 milliseconds”.
  • Queries should be difficult to cache. If they are easy to cache, it may be better to just keep them even further towards “the edge”, like in Cloudfront.

You can actually get a bit more space if needed; lambdas come with a 500MB ‘scratch space’ in /tmp. On startup, you could download data from s3 or some other hosted source. Since lambdas hang out for a while, you’d only pay this penalty when the function does a ‘cold start’, so it might be viable if your dataset almost fits.

Possible Use Cases

I’ll admit, those do seem like tough constraints. It seems there are quite a few use cases which do actually fit nicely between them, though.

Geospatial Databases

Geographic data is actually an ideal fit. Consider the common use case of a “store locator”.

  • Queries will be coming for any number of coordinates, making it difficult to cache
  • An awful lot of latitude/longitude/metadata can be stored in 250MB. For example, there are ~30,000 Starbucks locations worldwide. If you store the coordinates in 6 bytes, that maps to about 2.4 meters of accuracy. So, about 180 kB to store all of the actual locations, leaving plenty of room for additional metadata about the store. There are many businesses who have fewer significant locations to keep track of.
  • The data is likely to be updated infrequently; the physical world tends to move more slowly than the digital one.

I haven’t played around with it yet, but Buntdb looks like it may be a very nice embeddable database for general use, but especially for geospatial; it includes built-in support for finding the “N closest points” or “all points within a certain distance”.

There are also great embeddable libraries for geofencing, like regionagogo, built on top of Google’s s2 and boltdb.

Security Screening

There are lots of relatively small and static datasets that can be useful for determining if a request should be handled, or how careful we should be in handling it.

There are a number of free and paid reputation lists such as spamhaus (23k), torproject (153k), and emergingthreats (12k). With removing comments, deduplication, and CIDR grouping, this combined dataset could be shrunk even further.

There are also other databases keyed off IP Address that might be helpful, like the free GeoLite2. There’s a database for mapping IPs to countries, handy for those who need to comply with certain regulations. It comes already helpfully packed in a 1.75MB binary database, perfect for embedding.

NIST also publishes a list of common bad passwords, which contains the most common 1,000,000 passwords. In raw form, that’s still only 8MB. (Take a peek at the top 100 if you want to feel really good about your own password choices. I was curious why ‘dragon’ was in the top 10, and it turns out Wired just wrote about it a few months ago!)

Information like this, though, can be massively compressed, especially if probabilistic structures such as Bloom or Cuckoo Filters are used.

Spell Checking / Typo Correction

There are many domains where the valid terms to search a database against are constrained. Consider a database that worked with chemicals; it would be useful to be able to autocorrect “pulonium” to “polonium” before sending the query to a backend.

Wherever there’s a well-known list of keys to values, such as geographic information, or creative works like books, movies, music, etc, this can be useful. Instead of getting the (typo’d) query from the user, searching for it, and returning an error, with a local autocorrect database, there are many other options.

  • prescreening the query before sending to the backend, only letting valid queries through
  • catching queries that come back with no results, and retrying with the most probable alternative (like google does)
  • returning the error, but giving the user some “did you mean” suggestions

Spell checking and approximate matching algorithms tend to require lots of high volume queries, so are especially well suited to having the data local to you (and in memory, ideally.)

Caching The Hot Head

In many (and probably most) environments, queries follow a power law distribution. It’s not uncommon to have the top 1% of queries represent 90% – or much more – of the total query volume.

If the data is relatively stable, caching those requests directly in the lambda could provide significant performance improvements and cost savings – especially since lambdas are billed by their total running time, you pay on both ends to make queries.

If one of the “hot head” items were updated, that could be the trigger for deploying a new lambda with the latest values in it. This would be an easy way to over-optimize, so do the math before trying!

Graph Databases

I’ve been watching graph databases for many years. They’ve had a fairly slow burn of growth over time, but there’s even a managed one on Amazon now (AWS Neptune).

There’s an open source graph database called Cayley which is built to be embeddable and could be integrated into a Lambda.

These databases can be used for all kinds of data which would be a challenging fit for a relational model, such as social networks. I have seen the SQL query for friends of friends of friends and it is a thing which cannot be unseen.

Also, if a full-blown generic graph database is not warranted, it’s fairly straightforward to build out an optimized graph that can be queried within a process. Which is actually what I did to test out this idea!

Testing this idea

I’ve always been fascinated by the Oracle of Bacon. For those who haven’t seen it, the idea is to think of 2 actors, and then figure out a “path” you can take through the films they have been in to connect them. The number of movies you have to name is their “Bacon Number” or “degrees of separation.”

Don Cheadle and Charlize Theron are separated by 2 degrees. Don Cheadle was in “Avengers: Infinity War” with Josh Brolin, who was in “In the Valley of Elah” with Charlize Theron.

This is a perfect application of graph databases. The actors and the films are both nodes, and having appeared in them creates a link between those nodes.

movie data as graph

A teaser for the experiment (which actually generated the above text):

  • The graph has 920,000 films and 215,000 people
  • It compresses to about 9 MB of data (including the movie titles, people’s names, and relationships between them)
  • Querying the database takes only ~1.5ms, with likely some more optimizations possible
  • Running a lambda function to query this data and return the results completes incredibly quickly: ~40ms, which makes it suitable for even latency-sensitive applications, including for use as Alexa Skills.

In the next posts I’ll explore:

If you want to look at the code for the experiment, it’s at jbarratt/lambdadb.

IN CONCLUSION

While they won’t replace all of your database needs, embedding databases with lambdas has a huge amount of potential. Especially if you’re using a lower level language like Go, there are many off the shelf libraries that can help. A sampling that seem to be highly recommended:

  • Bleve for text indexing
  • Badger appears to be the leading embeddable key value store
  • QL and, of course, sqlite3 are embeddable SQL stores
  • Buntdb is a more sophisticated k/v store which includes spatial search
  • Cayley for graph data

I haven’t tested any of those specifically in lambda, but apparently I have a new hobby so it may happen soon. (And let me know if you try any of these ideas out!).