RCU To-Do List

Things on Paul's Reasonably Immediate List

  1. Fix additional bugs in TREE_PREEMPT_RCU as they arise. And there has been one or two of them. Chasing them down was more entertaining than usual.
  2. Create a small-memory-footprint RCU for single-CPU embedded devices. A starting point may be found here, with the LWN writeup here. This is now in -tip (commit 9b1d82fa1611706fa7ee1505f290160a18caf95d).
  3. Fix TREE_PREEMPT_RCU's implementation of synchronize_rcu_expedited(), with a hack for 2.6.32 (commit 019129d595caaa5bd0b41d128308da1be6a91869) and real fix in 2.6.33. The real fix is finally passing more than a few seconds of rcutorture testing, so there is hope, but more likely for 2.6.34 than 2.6.33.
  4. Create a TINY_PREEMPT_RCU, then drive the choice of RCU implementation directly off of CONFIG_SMP and CONFIG_PREEMPT.
  5. Getting Preemptable RCU priority boosting to mainline.
  6. Forbid grace periods from starting while force_quiescent_state() is running. This would have little or no effect on performance, while greatly simplifying race conditions.
  7. Get rid of CONFIG_RCU_CPU_STALL_DETECTOR, thus making stall detection unconditional. The overhead of this option is negligible, it has proven quite useful, and making it unconditional would reduce the number of combinations of configuration parameters that must be tested.
  8. ftrace plugins for RCU.
  9. Allow RCU read-side critical sections in idle loops. This has come up enough times over the past several years that it is time to go do it.

Software-Engineering Enhancements

  1. Make a “fire and forget” primitive that kfree()'s the specified memory block after a grace period elapses. Lai Jiangshan and Manfred Spraul are working on patches for this, see Lai's and Manfred's patches.
  2. Make a module parameter that controls the frequency with which force_quiescent_state() is invoked. Increasing this frequency can be useful when torture-testing RCU. But this can be done with a simple patch, so not sure that this is really worth it.
  3. Make a call_rcu_blocking(), call_rcu_blocking_bh(), and call_rcu_blocking_sched() that invoke synchronize_rcu() (or the corresponding API) if callbacks are backing up. However, the base call_rcu() family of functions already have checks to accelerate grace periods, and these primitives have extremely simple implementations, so they can wait until there is a strong need.

Optimizations

  1. Make rcutree track the largest CPU, and to restrict scans to that number. This change would be helpful in the common case where NR_CPUS is much larger than the actual number of CPUs.
  2. Can we remove one or more of the ACCESS_ONCE() primitives in TREE_PREEMPT_RCU's __rcu_read_lock() and __rcu_read_unlock()?
  3. Consider adding checks for RT threads to __call_rcu() so that we can have deterministic RCU updates.
  4. It may be possible to accelerate TREE_PREEMPT_RCU grace periods based on the fact that a CPU can unambiguously determine whether or not it is in an RCU read-side critical section at a specific point in time. A similar trick might also work (albeit probabilistically) for CONFIG_PREEMPT builds of TREE_RCU. In both cases, the first guess would be that note_new_gpnum() would be a good place for such an optimization.
  5. Reduce the need for force_quiescent_state() to send IPIs by making rcu_check_callbacks() invoke the much cheaper set_need_resched().
  6. Attempt to make RCU independent of the scheduling-clock interrupt, in order to reduce OS jitter.

Simple Cleanups

  1. Make rcupreempt_trace.c use seqfile. Li Zefan (lizf@cn.fujitsu.com) is taking this on, but using Frederic Weisbecker's statistical ftrace. Li Zefan's earlier non-ftrace patch may be found here.
  2. Simplify the handling of memory barriers in TREE_RCU's dyntick interface by creating a PUBLISH_AND_STORE() and a LOAD_AND_ACQUIRE() primitive. This should have the same readability benefits as did the change from naked memory barriers to rcu_assign_pointer() and rcu_dereference().
  3. It appears that some versions of gcc do not like forward references to static inline functions. But kernel/rcutree_plugin.h really wants to define forward-referenced static inline functions. What is the best way to handle this?
    1. Tell people to tune their gcc warnings down a bit.
    2. Remove the "inline" modifiers.
    3. Move the #include of kernel/rcu_plugin.h back up near the front of kernel/rcutree.c, bringing back the ugly list of functions referenced by the plugins.
    Removing the "inline" modifiers seems best at the moment, and is what commit dbe01350fa8ce0c11948ab7d6be71a4d901be151 in -tip does.

Inspection

  1. Look for places where people (wrongly, for preemptable RCU) assume that rcu_read_lock() disables preemption.
  2. Look for places where people assume that disabling preemption and/or irqs makes rcu_read_lock() unnecessary.
  3. Make sure Documentation/RCU/*.txt files better match reality.

Possibly Dubious Changes

  1. Remove the function pointer from the struct rcu_head and group these structures by function in order to:
    1. shrink the size of the struct rcu_head, and
    2. improve the efficiency of callback processing by grouping the calls though pointers.
    This has some non-trivial consequences, please see the discussion between Manfred and Paul for some of the issues involved.

    Another approach is to maintain a separate queue that tracks the functions to be invoked and the memory blocks to invoke them on, as Mathieu Desnoyers suggested. This means that the new defer_rcu() primitive can now block, either when waiting for a grace period to elapse or when allocating new memory. Alternatively, this primitive could return an error if the callback could not be enqueued immediately. Of course, the queue of callbacks must be present, so that memory savings will depend on the workload.

  2. Make rcutree keep track of quiescent states that a CPU passes through after a given grace period starts, but before that CPU is aware that a new grace period has started. The goal would be to reduce grace-period latency, but it is not clear how helpful this change would be. It is clear that this change could be quite difficult to get right.
  3. Change the size of the grace-period-number counter from long to int. I do not currently believe that this change is worthwhile, as the memory savings are miniscule, it could result in worse code on 64-bit systems, and it requires a change to the exported rcu_batches_completed() and rcu_batches_completed_bh() APIs.
  4. Replace Hierarchical RCU's manual bitmask scans with something like ffs() or ffsl(). Although this seems unlikely to make the code measureably faster, it might simplify the code a little bit.

RCU Done List

  1. Fixed an RCU performance bug located by Nick Piggin in which RCU takes a big performance hit trying to push RCU callbacks through the system. In 2.6.32 commit 37c72e56f6b234ea7387ba530434a80abf2658d8.
  2. Fold rcu_pending() into rcu_check_callbacks(). In 2.6.32-rc1 commit a157229cabd6dd8cfa82525fc9bf730c94cc9ac2.
  3. Rename rcu_qsctr_inc() to rcu_report_qs() and rcu_bh_qsctr_inc() to rcu_bh_report_qs(). Actually remapped differently as a result of making rcu_sched, rcu_bh, and rcu_preempt (when present) the basic RCU implementations. This is in 2.6.32-rc1 commit d6714c22b43fbcbead7e7b706ff270e15f04a791.
  4. Expedited grace periods. This is in 2.6.31 commit 03b042bf1dc14a268a3d65d38b4ec2a4261e8477. This implementation does not expedite preemptable RCU, But that is on the way.
  5. Additional work on user-level RCU implementations. Numerous variants in flight, no shortage! Best place to start is git://lttng.org/userspace-rcu.git, or, if you prefer tarballs, here.
  6. Applying lessons learned from the Hierarchical RCU experience to Preemptable RCU. This also includes moving the preemptable RCU state machine out of the scheduling-clock interrupt handler, speeding up the read-side primitives, and speeding up the grace periods. Also cutting about a thousand lines of code. See commit f41d911f8c49a5d65c86504c19e8204bb605c4fd.
  7. Make rcutree's synchronize_rcu() be barrier() for !SMP builds. Commit a682604838763981613e42015cd0e39f2989d6bb handles that and all the other RCU implementations while it is at it.
  8. Lai Jiangshan fixed a long-standing CPU-hotplug race in rcu_barrier(). Commit f69b17d7e745d8edd7c0d90390cbaa77e63c5ea3.
  9. Stomping out Hierarchical RCU bugs!!!
    1. Grace-period latency on small systems. Commit c12172c0251761c54260376eb29a5f6547495580.
    2. Suspend/resume issues. Commit 90a4d2c0106bb690f0b6af3d506febc35c658aa7.
    3. A couple of rcutorture bugs. Commits c59ab97e9ecdee9084d2da09e5a8ceea9a396508 and c9d557c19f94df42db78d4a5de4d25feee694bad.
    4. Commit a682604838763981613e42015cd0e39f2989d6bb teaching RCU that the idle task is not a queiscent state at boot time. (Actually a bug for all flavors of RCU, not just hierarchical RCU.)
    5. Commit ef631b0ca01655d24e9ca7e199262c4a46416a26 making RCU less IPI-happy.
  10. The C standard specifies “undefined behavior” for overflow of signed integral variables. Could this possibly bite us? Inspect and consider changing some of the long variables to unsigned long. First attempts to do this were quite ugly, so am happy to assume that all machines that Linux runs on will be twos-complement machines.