web page hit counter

Sunday, July 20, 2008

MySQL, PostgreSQL and BLOB Streaming

You can store surprisingly large binary objects in your relational database[1], but even if your chosen system is perfectly OK with a 100 megabyte BLOB, you're going to have to handle all that data on its way to and from storage.

With a 2 MB image file you can get away with reading the data fully into memory on each access. With a 100 MB file it's a little trickier. Not only is that a pretty big byte array, you may end up with multiple copies of it (your application's copy, the driver's copy, the database's copy, etc)

The obvious answer is to stream the data. JDBC's BLOB interface has the convenient-looking "getBinaryStream" and "setBinaryStream" calls[2]. But if you actually try to use the calls you may get a nasty (and database-specific) surprise.

MySQL, for instance, implements getBinaryStream in their JDBC driver as[3]:

return new ByteArrayInputStream(getBinaryData());

You see the problem. There's no point in streaming the data when the driver is just going to read it all into memory anyway. There are some workarounds[4], but ultimately there are server-side limitations that you can only be avoided via a brand-new database engine[5].

PostgreSQL has a reputation for being more mature than MySQL, and that's the case with handling large binary objects. Of course, another part of PostgresSQL's reputation is a Perl-like "multiple ways to do everything" approach, and there are a couple of different ways to handle BLOBs.

There are two basic API approaches for this sort of thing: either pretend the huge bolus of binary gunk is just another column value stored along with all the other data for the row (whether it's really stored that way or not), or put a pointer ("locator") to the data in the relational row, and provide some other interface for actually working with the data.

PostgreSQL does both, both at the database level (it has two different low-level mechanisms[6]) and the API level (it has several different APIs that expose out the underlying mechanism in various ways). Annoyingly, you can sort of mix and match the APIs and underlying implementations.

The following discussion assumes the use of oid's (locators), not bytea's (in-row blobs, see the footnote above), which means at the database level we're going to expose out the locator (oid) in the column. In other words, if you do a "SELECT" on the column, it's going to give you a bunch of integer pointers, not a bunch of binary content.

CREATE TABLE resource_record
(
id serial NOT NULL,
content oid,
...
)


Luckily, PostgreSQL's JDBC driver will (mostly) allow you to treat the locator column like it really holds the BLOB directly:

Blob content = ....
InputStream bodyStream = body.getBinaryStream();
...


Well, assuming you're inside a database transaction, which is generally the only time you can stream to or from a JDBC BLOB[8]

But it's not quite perfect. The lower-level Postgres calls[9] allow BLOBs to be shared, so deleting a row doesn't necessarily delete the corresponding blob. That conflicts with normal JDBC semantics that assume the the BLOB is part of the row and should be deleted when the row is deleted. Postgres' JDBC driver should probably implement those semantics, but it doesn't, so you have to run out and delete it yourself. One solution is to add the semantics back in via a trigger or rule:

CREATE OR REPLACE FUNCTION on_resource_delete() RETURNS trigger
AS $on_resource_delete$
BEGIN
PERFORM lo_unlink( old.content );
RETURN null;
END;
$on_resource_delete$
LANGUAGE plpgsql
IMMUTABLE
RETURNS NULL ON NULL INPUT

CREATE TRIGGER resource_delete_trigger
AFTER DELETE
ON resource_record
FOR EACH ROW EXECUTE PROCEDURE on_resource_delete()


If you're interested, the data is stored over in the system table "pg_largeobject"[10]

Since this is PostgreSQL, the discussion wouldn't be complete without a mention of vacuum. Vacuum is the PostgreSQL "garbage collector". If you don't run it, the disk space associated with deleted BLOBs won't be recovered. In production that may or may not matter, but during development it can a pain to lose the disk space.

On my OS X machine, the file associated with the pg_largeobject table turned out to be:

/Library/PostgresPlus/8.3/data/base/17456/2613

If you load up a couple of fairly large BLOBs, you can probably figure out what file is being used on your system. It's comforting to watch it grow and shrink.

So, although there are other options, it definitely works to:
  • Make the column an oid
  • Add the trigger if you want your BLOB deleted with its row
  • Use normal JDBC BLOB interface and stream away
  • Run a full vacuum occasionally if you need to recover disk space
  • Use a cache (like Squid) if you're serving the data back out over HTTP
[1] If you've chosen the right database, have some time to spend tuning, and are willing to install something like Squid if you're serving the data back out over HTTP, then single blobs of 100's of MB are not undoable. Whether it's good or evil is a different question.
[2] http://java.sun.com/javase/6/docs/api/java/sql/Blob.html
[3] mysql-connector-java-5.1.6/com/mysql/jdbc/Blob.java:108
[4] MySQL also has a (seriously crippled) locator-based version: http://dev.mysql.com/doc/refman/6.0/en/connector-j-reference-implementation-notes.html, see [5] for why it's problematic.
[5] http://www.blobstreaming.org/ Even if you fix the client side to not read the entire BLOB into memory, the server side still does, and insists and transferring the entire thing every time, even if the client only wants a tiny piece. Which is why you need the whole new engine.
[6] http://jdbc.postgresql.org/documentation/80/binary-data.html gives an overview of PostgreSQL BLOB support, and how there are a lot of options.
[8] Which is painful, because it means you may have to keep the transaction open a very long time if you're, say, streaming 80 MB of data in over http from a client on a slow network connection. That's the sort of thing that drives people to stream to a temporary file first, then from the temp file to the database.
[9] The lo_* routines, see http://www.postgresql.org/docs/8.3/interactive/largeobjects.html
[10] The data is stored as very many 2k chunks: http://www.postgresql.org/docs/7.4/interactive/catalog-pg-largeobject.html

You should follow me on twitter here.