Software Carnival: gossip

Tuesday, May 12, 2015

More thoughts on cassandra's gossip

Over the last few months I've been working out new ways to think and talk about cassandra's existing gossip system. This is an effort to help the cassandra community become more comfortable with this part of the database.

gossip is a verb

One of the more confusing aspects of cassandra's gossip system is the fact that two essential ideas are jammed together: cluster membership, and how that data is disseminated throughout the cluster. As an outsider looking in, it's very easy to conflate the two: when we look at nodetool, there is a command for "gossipinfo", prints membership information. When members of the community (myself included, until very recently) speak of gossip, we speak of "what's in gossip" or "can you show me the gossip output" when what we really interested in is "what is the state of the nodes in the cluster".

What I want to do is to separate these two concepts, membership and gossip. Gossip is a verb; it is a mechanism to disseminate information. Membership, on the other hand, is a noun; it is the metadata about nodes in the cluster. We can apply the verb to the noun (disseminate the node metadata), but we should recognize that the two are distinct.

gossip as repair (in the miniature)

Cassandra's standard mechanism for spreading user data across nodes is somewhat like this (highly simplified)

write path

coordinator gets request from client
determines nodes that own range
sends the writes to those nodes

repair (anti-entropy)

A node figures out all the other nodes that share the same ranges
they build up merkle trees for each range
send trees back to initiator
diff the merkle trees
exchange any divergent data

(Note: I've left out hints and read-repair for clarity)

Cassandra's gossip system is much like repair (simplified, as well)

a node selects another peer (randomly)
the initiator sends a digest of it's membership data
the peer sends any differences, and a digest of data it needs
initiator merges data from peer, and sends back and data the peer needs
peer merges data from initiator

Both repair and gossip perform anti-entropy, a periodic data reconciliation operation. They both send summaries (merkle trees or digests) of their data, and then send the full data if any divergence is detected. Thus, in a certain light, cassandra's gossip can be viewed as another form of repair; one that operates on cluster membership data instead of user data. The primary difference, then, between the dissemination of the two types of data is that user data has an initial broadcast or multicast operation when new or updated data is received; membership data is only disseminated via anti-entropy (the current gossip implementation).

Taking a step back, we can see cassandra's gossip as a full distributed system on it's own right, living within the context of another distributed system.

Thursday, July 31, 2014

Gossip papers

In preparation for my talk at the Cassandra Summit 2014 ("Demystifying Gossip"), I've been going back and reading a lot of existing literature on gossip and epidemic broadcast protocols. What follows is simply a listing of all the papers being carried around in my bag at this time, plus a quick note or two that might be helpful to others. This is not a thoroughly exhaustive listing by any measure, but it mainly grew organically as I started off with one paper, then checked it's references, read those, checked their references, ...

These papers aren't listed in any particular order, and any grouping is largely as I understand the problem at this time. There are other papers I'm sure I'm missing, and I haven't made through all of these yet (and a few I need to reread, a few more times!). Also, the notes are are probably somewhat subjective, but, meh, welcome to the internet.

Epidemic Algorithms for Replicated Database Maintenance - one of the first seminal papers describing gossip/epidemic protocols (1989). great read

The Peer Sampling Service: Experimental Evaluation of Unstructured Gossip-Based Implementations - First major paper to spell out the peer service in relation to gossip systems

Bimodal Multicast

HyParView - Hybrid Partial View, example of a peer sampling service (cares about membership & view maintenance only). maintains two partial views: active and passive; when peer from active is unreachable, a peer from passive view is promoted. special join logic when new peer tries to join.

Epidemic Broadcast Trees - Plumtree paper. Gossip overlay that uses a spanning tree. implemented in riak 2.0.

Thicket: A Protocol for Building and Maintaining Multiple Trees in a P2P Overlay An extension of Plumtree, and specifically optimizes for sender-based trees by have multiple trees over the cluster, and attempting to ensure a given peer is an interior node in only one of the trees.

Gossip-Based Broadcast - Nice overview of gossip systems written by HyParView/Plumtree authors

Correctness of a Gossip Based Membership Protocol - partial view alg, with a formal proof

SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

Efficient Reconciliation and Flow Control for Anti-Entropy Protocols - adding anti-entropy to gossip exchanges. basis of cassandra's gossip subsystem

GEMS: Gossip-Enabled Monitoring Service for Heterogeneous Distributed Systems - attempts to introduce a consensus algorithm over gossip for determining some global notion of a peer's liveness; used as part of a monitoring service. each node emits a continuous heartbeat for failure detection on peers, but peers judge a target node by how rounds of gossip it has been since it saw an update from that target, which is a (semi-)subjective measurement.

Large-Scale Newscast Computing on the Internet - describes the newscast protocol.

A Robust and Scalable Peer-to-Peer Gossiping Protocol - Followup to the Newscast paper, primarily deals with the implementation and testing the Newscast authors performed. not much info about the actual protocol, and no source code, either.

Spatial Gossip and Resource Location Protocols

Failure Detectors (not specifically gossip, pre se, but close enough for my needs)

A Gossip-Style Failure Detection Service - not dissimilar to cassandra's basic notion of failure detection. based on random gossip and a heartbeat counter

The ϕ Accrual Failure Detector - output a suspicion level of a node based on received heartbeat rate. a modified version is used in cassandra.

The Failure Detector Abstraction

Implementations

The two implementations I know best are those found in cassandra and riak. Serf is an implementation of the SWIM paper in golang, but have never had a chance to check it out.