What is RP?
RP stands for "Relativistic Programming", which is a programming technique for concurrent shared-memory architectures. RP algorithms are distinguished by two properties. The first property is that RP algorithms tolerate different threads seeing events occurring in different orders, so that events are not necessarily globally ordered, but rather subject to constraints of per-thread ordering, and in a few cases, partial-order constraints on global ordering. The second property is that RP algorithms tolerate conflicts, for example, one thread can safely modify a memory location despite the fact that other threads might be concurrently reading that same memory location.
These properties allow several important classes of problems to be addressed with wait-free linearly scalable algorithms having overheads rivaling that of the equivalent synchronization-free single-threaded code. In many cases, these algorithms have extremely simple implementations, so that practitioners can readily make use of them. A case in point is RCU (read-copy update), which was accepted into the Linux kernel in October 2002, and as of May 2012 has more than 6000 uses in the 3.4 Linux kernel. Please note that although RP is not a generally applicable technique, it does bring great benefits to those situations to which it does apply. Where it applies, it provides the following benefits:
wait-free progress guarantees to parts of, or, in some cases, all of the RP algorithm. For example, RCU read-side primitives are wait-free (in some cases, having zero overhead), and for some RCU algorithms, the updates are non-blocking (and are wait-free on systems supporting an atomic exchange instruction). The well-known per-thread split counter approach provides a simple example of a wait-free RP algorithm.
extremely low overhead to parts, or, in some cases, all of the RP algorithm. RCU is again a case in point for readers, and, in a few special cases, to updaters as well.
linear scalability to parts, and, in a few special cases, all of the RP algorithm. RCU readers and per-thread-counter updaters are again examples.
substantial reductions in lock contention and memory contention.
simplicity, as evidenced by the Linux kernel community's successful adoption of RCU (and split counters).
Although isolated examples of RP algorithms have appeared in the literature going back at least to 1980, recent work is distinguished by a practical conceptual framework that permits practitioners to reap the benefits of these algorithms in the surprisingly large number of situations in which they apply.
Other examples of RP include split counters for statistics, per-CPU/thread event logging/history, work-stealing schedulers using private per-CPU/thread work queues in addition to shared work queues, memory allocation, and set membership. Set membership can be thought of as a generalization of event logging/history, work queues, and memory allocation.
RP Tolerates a Lack of a Global Order on Events
Although we are not aware of very many algorithms that tolerate a complete lack of order, there are no shortage of algorithms that operate perfectly correctly without a global order. Common non-total ordering constraints include the following:
Own Accesses Seen in Order
A given thread is constrained to see its own memory accesses as having occurred in program order.
This constraint is usually implicit.
Aligned Loads and Stores Are Atomic
Properly aligned loads and stores of certain properly aligned small variables (e.g., pointers) is atomic. In other words, if a load and store to the same such variable execute concurrently, the load will see either the initial value or the value stored, but not some mashup of the two values. All threads will agree on the order of values taken on by such a variable ("cache coherence").
This constraint is usually implicit, though there is some work on RP algorithms for cache-incoherent systems.
This property is captured by C++0x's "relaxed atomic" facility. Lamport identified it as early as 1986. It is also required by the Linux kernel.
A specific store executed by given thread may be constrained to be ordered after memory accesses preceding it in that thread.
This property is captured by C++0x's notion of "store release", and by the Linux kernel's rcu_assign_pointer() primitive.
A specific load executed by a given thread may be constrained to be ordered before memory accesses following it in that thread.
This property is captured by C++0x's notion of "load acquire".
Data Dependency Ordering
A specific load of a pointer executed by a given thread may be constrained to be ordered before memory accesses whose addresses are computed from the pointer value.
This property is captured by C++0x's notion of "dependency ordering" and by the Linux kernel's rcu_dereference() primitive.
All of a given thread's memory accesses preceding a pre-specified point in that thread's execution may be constrained to be ordered before memory accesses following that same point.
This property is captured by C++0x's notion of "memory fences", and by the Linux kernel's smp_mb() primitive. That said, there are many variations on memory fences or "barriers":
- "Lightweight" or "acquire/release" fences order all accesses except that stores preceding the fence are not ordered against loads following the fence.
- Other types of fences allow combinations of load/load, load/store, store/load, and store/store constraints. For example, the PowerPC lightweight fence enforces load/load, load/store, and store/store constraints.
- Fences may or may not enforce transitive ordering.
Given pairs of code sequences, possibly in different threads, are constrained to execute atomically with respect to each other. Such code sequences are commonly called "critical sections", and the required atomicity can be enforced by locking. More recent work has focused on "atomic sections" or "transactions" to enforce atomicity.
A given code sequence "R", if it reads a pointer to a structure that is being concurrently initialized and inserted by another code sequence "I", will be constrained such that "R" follows "I". This is the basic publication guarantee provided by RCU.
A given code sequence "R", if it runs concurrently with another code sequence (possibly in another thread) "M", will be constrained to be ordered before any code sequence "F", such that that there is a "grace period" between "M" and "F". This is the basic retraction guarantee provided by RCU.
In an RCU algorithm, code sequence "R" would be an RCU read-side critical section (hence preceded by an rcu_read_lock() and followed by an rcu_read_unlock()), code sequence "M" would mutate the RCU-protected data structure, and code sequence "F" would typically free up data elements removed by "M". A grace period is enforced by synchronize_rcu() or call_rcu().
Ordering Constraints: Discussion
If an RP algorithm does not specify an ordering constraint, then the enclosing software environment is free to choose an arbitrary reordering.
A given RP algorithm may have different ordering constraints in different portions of that algorithm. Primitives such as those noted above can help to enforce these constraints when and as required, leaving the compiler and the CPU free to more aggressively optimize other portions of the algorithm. These partial orderings, particularly the thread-specific ones, are reminiscent of the handling of time in relativistic physics, hence the name.
The practice of RP consists of using combinations of the above constraints to obtain a simple, fast, and scalable program in those situations to which RP applies.
Tolerance of Conflicts Between Concurrent Threads
Another important aspect of RP algorithms is that they are designed to tolerate conflicts between concurrent threads, so that one thread might be updating a given location while some other thread is concurrently reading it. For one example, RCU tolerates conflicts via its publication and retraction guarantees. If a reader and an updater conflict, these guarantees ensure that the reader is presented with valid data despite the conflict. For a second example, per-thread counters rely on the basic properties of arithmetic, a tolerance for approximate results, and atomic loads and stores to ensure that readers are presented with valid and acceptably precise data despite conflicts with concurrent updaters.
This tolerance of conflicts is essential to RP algorithms' combination of simplicity and wait freedom: failing to tolerate conflicts has the following disadvantages:
- Latency. If conflicts cannot be tolerated, algorithms must either avoid or respond to conflicts, either of which introduces delay, either by blocking, by rollback/retry, or by other delays.
- Complexity. In absence of RP, complex conflict-avoidance algorithms are generally required to maintain wait freedom.
In the large number of cases where it applies, RP enables low latency (along with high throughput and excellent scalability) with low complexity. The many uses of RCU and related techniques in the Linux kernel is a testaments to RP's strengths.
What challenges does RP introduce?
RP introduces several new challenges. First, the temporal reference frames of different processors require multiple versions of memory to be maintained. In the absence of infinite memory, garbage collection technology is required to recycle old versions of memory. This technology must function in such a way as to sustain the performance and scalability benefits of RP. Specifically, it must not require frequent interprocessor interactions nor use expensive atomic instructions or memory barriers in the conflict tolerant execution sequences. Second, it requires a programming paradigm shift on the part of the concurrent programmer. The programming paradigm is not difficult to master, but it is new. Example uses of RP exist for certain concurrent data types but full generalization of the approach is still an active area of research.
Does it scale well?
RP has extremely good performance and scalability properties. Many uses of RP in the Linux Kernel have resulted in several orders of magnitude performance improvement compared to locking and transactional memory.
Is it easy to program with?
RP is not difficult to program with. Allowing each execution sequence to proceed using its own view of memory, by default, simplifies concurrent programming because it prevents memory from changing spontaneously. Threads are guaranteed to observe coherent memory, i.e., memory will contain values that were actually written at some time in the past. Read paths also exhibit deterministic performance characteristics, since they can not block or retry. This feature simplifies programming of time-sensitive systems. Nevertheless, RP is a new programming paradigm with a new interface and there are several ways to misuse it. Many of the misuses we have observed so far can be caught through static analysis. Development of static analysis tools for RP is an active area of research within our project.
How much real-world use has it seen?
Read Copy Update (RCU), an early example of RP, has seen extensive real-world use in the Linux Kernel. The number of uses of RCU in the Linux Kernel (as of 2.6.37) stands at over 4500. It has been used successfully by numerous developers across several different sub-systems in Linux. It has also seen production use in other operating systems including DYNIX/ptx, IBM's VM hypervisor layer and the Tornado and K42 operating systems. In many cases it has led to substantial performance improvements and code simplification.
A user-level RCU implementation is also available.
Other examples of RP have also seen heavy use over the past few decades, including the afore-mentioned split counters for statistics, per-CPU/thread event logging/history, and memory allocation.
- Robert Bauer - PhD Student (PSU)
- Ahmed Badran - MS Student (PSU)
- Jim Cotillier - PhD Student (PSU)
- Mathieu Desnoyers - Consultant, EfficiOS (Ph.D. Ecole Polytechnique de Montréal)
- James Hook - Associate Professor (PSU)
- Phil Howard - PhD Student (PSU)
- Paul McKenney - Distinguished Engineer, IBM Linux Technology Center
- Josh Triplett - PhD Student (PSU)
- Jonathan Walpole - Professor (PSU), Project PI
- Eric Wheeler - PhD Student (PSU)
- Suzanne Wood - MS Student (PSU)
"User-Level Implementations of Read-Copy Update and supplementary materials,", Mathieu Desnoyers, Paul E. McKenney, Alan S. Stern, Michel R. Dagenais, and Jonathan Walpole, to appear in IEEE Transactions on Parallel and Distributed Systems.
"Introducing technology into the Linux kernel: a case study," Paul E. McKenney and Jonathan Walpole, SIGOPS Oper. Syst. Rev., Volume 42, Number 5, July 2008.
"The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with Linux," D. Guniguntala, P.E. McKenney, J. Triplett, and J. Walpole, IBM Systems Journal, Volume 47, Number 2, 2008.
"What is RCU, Fundamentally?," Paul E. McKenney and Jonathan Walpole In Linux Weekly News, December 27, 2007
"Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking and Transactional Memory," Paul E. McKenney, Maged M. Michael, and Jonathan Walpole In Proceedings of the 4th Workshop on Programming Languages and Operating Systems (PLOS 2007), Skamania Lodge, WA, October 2007. (talk slides )
"Performance of Memory Reclamation for Lockless Synchronization" Thomas E Hart, Paul E McKenney, Angela Demke Brown, and Jonathan Walpole, to appear in the Journal of Parallel and Distributed Computing.
"Exploiting Deferred Destruction: An Analysis of Read-Copy-Update Techniques" Paul McKenney, Ph.D. Thesis, OGI School of Science and Engineering, OHSU, July 2004. (presentation)