Thursday, July 31, 2014

Gossip papers

In preparation for my talk at the Cassandra Summit 2014 ("Demystifying Gossip"), I've been going back and reading a lot of existing literature on gossip and epidemic broadcast protocols. What follows is simply a listing of all the papers being carried around in my bag at this time, plus a quick note or two that might be helpful to others. This is not a thoroughly exhaustive listing by any measure, but it mainly grew organically as I started off with one paper, then checked it's references, read those, checked their references, ...

These papers aren't listed in any particular order, and any grouping is largely as I understand the problem at this time. There are other papers I'm sure I'm missing, and I haven't made through all of these yet (and a few I need to reread, a few more times!). Also, the notes are are probably somewhat subjective, but, meh, welcome to the internet.

Epidemic Algorithms for Replicated Database Maintenance - one of the first seminal papers describing gossip/epidemic protocols (1989). great read

The Peer Sampling Service: Experimental Evaluation of Unstructured Gossip-Based Implementations - First major paper to spell out the peer service in relation to gossip systems

Bimodal Multicast

HyParView - Hybrid Partial View, example of a peer sampling service (cares about membership & view maintenance only). maintains two partial views: active and passive; when peer from active is unreachable, a peer from passive view is promoted. special join logic when new peer tries to join.

Epidemic Broadcast Trees - Plumtree paper. Gossip overlay that uses a spanning tree. implemented in riak 2.0.

Thicket: A Protocol for Building and Maintaining Multiple Trees in a P2P Overlay An extension of Plumtree, and specifically optimizes for sender-based trees by have multiple trees over the cluster, and attempting to ensure a given peer is an interior node in only one of the trees.

Gossip-Based Broadcast - Nice overview of gossip systems written by HyParView/Plumtree authors

Correctness of a Gossip Based Membership Protocol - partial view alg, with a formal proof

SWIM: Scalable Weakly-consistent Infection-style Process Group Membership Protocol

Efficient Reconciliation and Flow Control for Anti-Entropy Protocols - adding anti-entropy to gossip exchanges. basis of cassandra's gossip subsystem

GEMS: Gossip-Enabled Monitoring Service for Heterogeneous Distributed Systems - attempts to introduce a consensus algorithm over gossip for determining some global notion of a peer's liveness; used as part of a monitoring service. each node emits a continuous heartbeat for failure detection on peers, but peers judge a target node by how rounds of gossip it has been since it saw an update from that target, which is a (semi-)subjective measurement.

Large-Scale Newscast Computing on the Internet - describes the newscast protocol.

A Robust and Scalable Peer-to-Peer Gossiping Protocol - Followup to the Newscast paper, primarily deals with the implementation and testing the Newscast authors performed. not much info about the actual protocol, and no source code, either.

Spatial Gossip and Resource Location Protocols

Failure Detectors (not specifically gossip, pre se, but close enough for my needs)

A Gossip-Style Failure Detection Service - not dissimilar to cassandra's basic notion of failure detection. based on random gossip and a heartbeat counter

The ϕ Accrual Failure Detector - output a suspicion level of a node based on received heartbeat rate. a modified version is used in cassandra.

The Failure Detector Abstraction

Implementations

The two implementations I know best are those found in cassandra and riak. Serf is an implementation of the SWIM paper in golang, but have never had a chance to check it out.