Thursday, December 6, 2007

Playing with Hibernate Shards

After hearing a lot of discussion at QCon and elsewhere about the notion of database sharding (a new, hipper name for the older, more mundane name "partitioning"), I decided to download the Hibernate Shards project and give it a whirl (what else do you do when you can't sleep?). I wrote an earlier blog entry on running multiple MySQL instances that will help when playing with Hibernate Shards.

As I'm reasonably proficient with Hibernate, I figured that I could be able to jump in and get a pretty good understanding just with a few days effort. Luckily, I was correct - my basic assumptions and paradigms for using Hibernate remain mostly intact (Configuration, SessionFactory, and Session object are almost the exact same). The details, of course, lie more in the distribution if data between multiple database instances.

Just a few quick Hibernate tech notes before talking about the data implications. You need to have a little bit more going on in your Hibernate config file(s), one file for each shard. Nothing big there. When creating a SessionFactory, you have to provide implementations for the following interfaces:
  • ShardAccessStrategy - a strategy for accessing sharded databases for queries (not loading an item by it's id). The provided implementations offer either sequential (including a round-robin, load balanced version) or parallel access.
  • ShardResolutionStrategy - a strategy for determining which shard to access when loading an entity.
  • ShardSelectionStrategy - a strategy for determining which shard to store a new entity in.
As relates to data, let's first discuss the idea of entity ID generation. As your data is now spread out multiple databases, you really can't use a database sequence for generating ids - unless you restrict each database's valid sequence values to a restricted range (db1 gets ids 0 - 3000, db2 gets ids 3001 - 6000, and so on). Somehow, this sounds easy but rather self-limiting. Out of the box, Hibernate Shards provides two mechanisms for managing ID generation:
  • ShardedUUIDGenerator - this implementation basically creates a large random number, and also encodes the shard id into the id. What you get is a really big number (my sample tests had about 30-36 digits each).
  • ShardedTableHiLoGenerator - this implementation creates a hilo table in one of the shards and uses that to generate all IDs. Of course, the shard then becomes not only a bottle neck for performance, but is a single point of failure.
After playing with it and thinking for a few days, the UUID generator seems the better choice, as long as your db is cool with big primary keys. MySQL worked fine with this, but I needed to make the column of type numeric(64,0) - 64 may be a bit large, but it made my sample code work so I left it. Of course, as your entity's id is shard specific, all the data related to that entity must reside on the same shard. This makes total sense and is definitely what you want from both management and performance perspectives.

Of course, deciding exactly how spread data amongst the shards is a big decision. Unfortunately, Hibernate Shards does not provide a facility to live resharding (not that it would be easy, mind you). Essentially, you would have to do some slight-of-hand while then data was being moved/redistributed, then update your Hibernate configurations and entity to shard mappings. Hibernate Shards does, however, provide a concept of virtual shard ids, and you can point multiple virtual shards to physical shards. This seems like a good idea as you can define many virtual shards which can map to just to two or three physical shards, then as you grow, it's only a slight config update to repoint the virtual shard. The documentation says "Virtual shards are cheap", so it's probably reasonable to create a bunch of virtual shards as long as it doesn't hurt performance.

Hibernate Shards does have a few gotchas:
  • HQL is pretty much unuseable at this point, and there seem to be some incomplete pieces with Criteria queries. This is mainly due just to the immaturity of the project; I expect they'll be coming along soon.
  • Master values from lookup tables. As Hibernate demands one and only one instance of an object with a given id, if you use master objects from a lookup table, you need to be careful that you don't get into strange cross-shard problems. This should only happen when saving/updating, but it is something to be aware of. For some advice about this problem, I have another blog entry about that.
All in all, Hibernate Shards is a rather nice wrapper for hiding all the complexity of spreading your data around multiple databases. As it's still in beta, I'm not sure if it's production-ready. As a web site, especially e-commerce sites, are more than just read-repositories for front-end HTML pages, there may be a need for more robust functionality before making the dive.


rvk said...

Curious to see if you came across any discussions regarding sharding strategies., e.g. shard per entity, etc.

From what I hear, the hardest part of shards is re-balancing them; bummer that hibernate hasn't taken it that far yet. --rvk

Jason Brown said...

I did find some notions about sharding strategies, but it seems like you really need to find out what's gonna your the best for your data set. In the week I spent with Shards, I was able to see many different ways of handling sharding data - it really comes down the data itself.

For re-balancing, I think why Hibernate passes on it is that it's not a trivial operation, and perhaps they felt it better to defer to either a DBA team or a future implementation. It also gets weird because hibernate needs to know the shard id + entity identifier to load an object, so live re-sharding could get weird. They do have the notion of virtual shards, however, which should be able to alleviate some of the pain.