Add GC deadlock discussion

Only comment and markdown changes. Bug: 195336624 Bug: 195261575 Test: Treehugger Change-Id: I3118ab4009c7f31006b62714ee36b5287f33aa3f
author: Hans Boehm <hboehm@google.com> 2021-08-16 16:50:28 -0700
committer: Hans Boehm <hboehm@google.com> 2021-08-18 17:14:17 +0000
commit: 0529cfa3efd0f6f7a167aa833fce99d6cf29a518 (patch)
tree: e0079b2b55ecb84f11582eb65387d92cc36e570c
parent: 555eefef9a27995ef341cdf44ed60c61953e2e3f (diff)
3 files changed, 72 insertions, 4 deletions
diff --git a/runtime/base/locks.h b/runtime/base/locks.h
index 7008539154..24fb2e0518 100644
--- a/runtime/base/locks.h
+++ b/runtime/base/locks.h
@@ -48,8 +48,8 @@ enum LockLevel : uint8_t {
   kJniIdLock,
   kNativeDebugInterfaceLock,
   kSignalHandlingLock,
-  // A generic lock level for mutexs that should not allow any additional mutexes to be gained after
-  // acquiring it.
+  // A generic lock level for mutexes that should not allow any additional mutexes to be gained
+  // after acquiring it.
   kGenericBottomLock,
   // Tracks the second acquisition at the same lock level for kThreadWaitLock. This is an exception
   // to the normal lock ordering, used to implement Monitor::Wait - while holding one kThreadWait
diff --git a/runtime/gc/reference_processor.h b/runtime/gc/reference_processor.h
index 54de5cc572..51c50544bf 100644
--- a/runtime/gc/reference_processor.h
+++ b/runtime/gc/reference_processor.h
@@ -100,7 +100,8 @@ class ReferenceProcessor {
   // If this is true, then we cannot return a referent (see comment in GetReferent).
   bool preserving_references_ GUARDED_BY(Locks::reference_processor_lock_);
   // Condition that people wait on if they attempt to get the referent of a reference while
-  // processing is in progress.
+  // processing is in progress. Broadcast when an empty checkpoint is requested, but not for other
+  // checkpoints or thread suspensions. See mutator_gc_coord.md.
   ConditionVariable condition_ GUARDED_BY(Locks::reference_processor_lock_);
   // Reference queues used by the GC.
   ReferenceQueue soft_reference_queue_;
diff --git a/runtime/mutator_gc_coord.md b/runtime/mutator_gc_coord.md
index 84327c062c..7581ff7373 100644
--- a/runtime/mutator_gc_coord.md
+++ b/runtime/mutator_gc_coord.md
@@ -120,7 +120,7 @@ suspended thread stays suspended, and then running the function on its behalf.
 `RunCheckpoint()` does not wait for completion of the function calls triggered by
 the resulting `RequestCheckpoint()` invocations.
 
-**Empty Checkpoints**
+**Empty checkpoints**
 : ThreadList provides `RunEmptyCheckpoint()`, which waits until
 all threads have either passed a suspend point, or have been suspended. This
 ensures that no thread is still executing Java code inside the same
@@ -135,6 +135,73 @@ a `SuspendAll()` call to suspend one or all threads until they are resumed by
 thread(s) are suspended (again, only in the sense of not running Java code)
 when the call returns.
 
+Deadlock freedom
+----------------
+
+It is easy to deadlock while attempting to run checkpoints, or suspending
+threads. In particular, we need to avoid situations in which we cannot suspend
+a thread because it is blocked, directly, or indirectly, on the GC completing
+its task. Deadlocks are avoided as follows:
+
+**Mutator lock ordering**
+The mutator lock participates in the normal ART lock ordering hierarchy, as though it
+were a regular lock. See `base/locks.h` for the hierarchy. In particular, only
+locks at or below level `kPostMutatorTopLockLevel` may be acquired after
+acquiring the mutator lock, e.g. inside the scope of a `ScopedObjectAccess`.
+Similarly only locks at level strictly above `kMutatatorLock`may be held while
+acquiring the mutator lock, e.g. either by starting a `ScopedObjectAccess`, or
+ending a `ScopedThreadSuspension`.
+
+This ensures that code that uses purely mutexes and threads state changes cannot
+deadlock: Since we always wait on a lower-level lock, the holder of the
+lowest-level lock can always progress. An attempt to initiate a checkpoint or to
+suspend another thread must also be treated as an acquisition of the mutator
+lock: A thread that is waiting for a lock before it can respond to the request
+is itself holding the mutator lock, and can only be blocked on lower-level
+locks. And acquisition of those can never depend on acquiring the mutator
+lock.
+
+**Waiting**
+This becomes much more problematic when we wait for something other than a lock.
+Waiting for something that may depend on the GC, while holding the mutator lock,
+can potentially lead to deadlock, since it will prevent the waiting thread from
+participating in GC checkpoints. Waiting while holding a lower-level lock like
+`thread_list_lock_` is similarly unsafe in general, since a runnable thread may
+not respond to checkpoints until it acquires `thread_list_lock_`. In general,
+waiting for a condition variable while holding an unrelated lock is problematic,
+and these are specific instances of that general problem.
+
+We do currently provide `WaitHoldingLocks`, and it is sometimes used with
+low-level locks held. But such code must somehow ensure that such waits
+eventually terminate without deadlock.
+
+One common use of WaitHoldingLocks is to wait for weak reference processing.
+Special rules apply to avoid deadlocks in this case: Such waits must start after
+weak reference processing is disabled; the GC may not issue further nonempty
+checkpoints or suspend requests until weak reference processing has been
+reenabled, and threads have been notified. Thus the waiting thread's inability
+to respond to nonempty checkpoints and suspend requests cannot directly block
+the GC. Non-GC checkpoint or suspend requests that target a thread waiting on
+reference processing will block until reference processing completes.
+
+Consider a case in which thread W1 waits on reference processing, while holding
+a low-level mutex M. Thread W2 holds the mutator lock and waits on M. We avoid a
+situation in which the GC needs to suspend or checkpoint W2 by briefly stopping
+the world to disable weak reference access. During the stop-the-world phase, W1
+cannot yet be waiting for weak-reference access.  Thus there is no danger of
+deadlock while entering this phase. After this phase, there is no need for W2 to
+suspend or execute a nonempty checkpoint. If we replaced the stop-the-world
+phase by a checkpoint, W2 could receive the checkpoint request too late, and be
+unable to respond.
+
+Empty checkpoints can continue to occur during reference processing.  Reference
+processing wait loops explicitly handle empty checkpoints, and an empty
+checkpoint request notifies the condition variable used to wait for reference
+processing, after acquiring `reference_processor_lock_`.  This means that empty
+checkpoints do not preclude client threads from being in the middle of an
+operation that involves a weak reference access, while nonempty checkpoints do.
+
+
 [^1]: Some comments in the code refer to a not-yet-really-implemented scheme in
 which the compiler-generated code would load through the address at
 `tlsPtr_.suspend_trigger`. A thread suspension is requested by setting this to
author	Hans Boehm <hboehm@google.com>	2021-08-16 16:50:28 -0700
committer	Hans Boehm <hboehm@google.com>	2021-08-18 17:14:17 +0000
commit	0529cfa3efd0f6f7a167aa833fce99d6cf29a518 (patch)
tree	e0079b2b55ecb84f11582eb65387d92cc36e570c
parent	555eefef9a27995ef341cdf44ed60c61953e2e3f (diff)