Tuesday, May 12, 2015

More thoughts on cassandra's gossip

Over the last few months I've been working out new ways to think and talk about cassandra's existing gossip system. This is an effort to help the cassandra community become more comfortable with this part of the database.

 

gossip is a verb


One of the more confusing aspects of cassandra's gossip system is the fact that two essential ideas are jammed together: cluster membership, and how that data is disseminated throughout the cluster. As an outsider looking in, it's very easy to conflate the two: when we look at nodetool, there is a command for "gossipinfo", prints membership information. When members of the community (myself included, until very recently) speak of gossip, we speak of "what's in gossip" or "can you show me the gossip output" when what we really interested in is "what is the state of the nodes in the cluster".

What I want to do is to separate these two concepts, membership and gossip. Gossip is a verb; it is a mechanism to disseminate information. Membership, on the other hand, is a noun; it is the metadata about nodes in the cluster. We can apply the verb to the noun (disseminate the node metadata), but we should recognize that the two are distinct.

 

gossip as repair (in the miniature)


Cassandra's standard mechanism for spreading user data across nodes is somewhat like this (highly simplified)

  • write path
    • coordinator gets request from client
    • determines nodes that own range
    • sends the writes to those nodes
  •  repair (anti-entropy)
    • A node figures out all the other nodes that share the same ranges
    • they build up merkle trees for each range
    • send trees back to initiator
    • diff the merkle trees
    • exchange any divergent data

(Note: I've left out hints and read-repair for clarity)

Cassandra's gossip system is much like repair (simplified, as well)
  • a node selects another peer (randomly)
  • the initiator sends a digest of it's membership data
  • the peer sends any differences, and a digest of data it needs
  • initiator merges data from peer, and sends back and data the peer needs
  • peer merges data from initiator

Both repair and gossip perform anti-entropy, a periodic data reconciliation operation. They both send summaries (merkle trees or digests) of their data, and then send the full data if any divergence is detected. Thus, in a certain light, cassandra's gossip can be viewed as another form of repair; one that operates on cluster membership data instead of user data. The primary difference, then, between the dissemination of the two types of data is that user data has an initial broadcast or multicast operation when new or updated data is received; membership data is only disseminated via anti-entropy (the current gossip implementation).

Taking a step back, we can see cassandra's gossip as a full distributed system on it's own right, living within the context of another distributed system.