As I'm reasonably proficient with Hibernate, I figured that I could be able to jump in and get a pretty good understanding just with a few days effort. Luckily, I was correct - my basic assumptions and paradigms for using Hibernate remain mostly intact (Configuration, SessionFactory, and Session object are almost the exact same). The details, of course, lie more in the distribution if data between multiple database instances.
Just a few quick Hibernate tech notes before talking about the data implications. You need to have a little bit more going on in your Hibernate config file(s), one file for each shard. Nothing big there. When creating a SessionFactory, you have to provide implementations for the following interfaces:
- ShardAccessStrategy - a strategy for accessing sharded databases for queries (not loading an item by it's id). The provided implementations offer either sequential (including a round-robin, load balanced version) or parallel access.
- ShardResolutionStrategy - a strategy for determining which shard to access when loading an entity.
- ShardSelectionStrategy - a strategy for determining which shard to store a new entity in.
- ShardedUUIDGenerator - this implementation basically creates a large random number, and also encodes the shard id into the id. What you get is a really big number (my sample tests had about 30-36 digits each).
- ShardedTableHiLoGenerator - this implementation creates a hilo table in one of the shards and uses that to generate all IDs. Of course, the shard then becomes not only a bottle neck for performance, but is a single point of failure.
Of course, deciding exactly how spread data amongst the shards is a big decision. Unfortunately, Hibernate Shards does not provide a facility to live resharding (not that it would be easy, mind you). Essentially, you would have to do some slight-of-hand while then data was being moved/redistributed, then update your Hibernate configurations and entity to shard mappings. Hibernate Shards does, however, provide a concept of virtual shard ids, and you can point multiple virtual shards to physical shards. This seems like a good idea as you can define many virtual shards which can map to just to two or three physical shards, then as you grow, it's only a slight config update to repoint the virtual shard. The documentation says "Virtual shards are cheap", so it's probably reasonable to create a bunch of virtual shards as long as it doesn't hurt performance.
Hibernate Shards does have a few gotchas:
- HQL is pretty much unuseable at this point, and there seem to be some incomplete pieces with Criteria queries. This is mainly due just to the immaturity of the project; I expect they'll be coming along soon.
- Master values from lookup tables. As Hibernate demands one and only one instance of an object with a given id, if you use master objects from a lookup table, you need to be careful that you don't get into strange cross-shard problems. This should only happen when saving/updating, but it is something to be aware of. For some advice about this problem, I have another blog entry about that.
2 comments:
Curious to see if you came across any discussions regarding sharding strategies., e.g. shard per entity, etc.
From what I hear, the hardest part of shards is re-balancing them; bummer that hibernate hasn't taken it that far yet. --rvk
I did find some notions about sharding strategies, but it seems like you really need to find out what's gonna your the best for your data set. In the week I spent with Shards, I was able to see many different ways of handling sharding data - it really comes down the data itself.
For re-balancing, I think why Hibernate passes on it is that it's not a trivial operation, and perhaps they felt it better to defer to either a DBA team or a future implementation. It also gets weird because hibernate needs to know the shard id + entity identifier to load an object, so live re-sharding could get weird. They do have the notion of virtual shards, however, which should be able to alleviate some of the pain.
Post a Comment