Add GC deadlock discussion
Only comment and markdown changes.
Bug: 195336624
Bug: 195261575
Test: Treehugger
Change-Id: I3118ab4009c7f31006b62714ee36b5287f33aa3f
diff --git a/runtime/base/locks.h b/runtime/base/locks.h
index 7008539..24fb2e0 100644
--- a/runtime/base/locks.h
+++ b/runtime/base/locks.h
@@ -48,8 +48,8 @@
kJniIdLock,
kNativeDebugInterfaceLock,
kSignalHandlingLock,
- // A generic lock level for mutexs that should not allow any additional mutexes to be gained after
- // acquiring it.
+ // A generic lock level for mutexes that should not allow any additional mutexes to be gained
+ // after acquiring it.
kGenericBottomLock,
// Tracks the second acquisition at the same lock level for kThreadWaitLock. This is an exception
// to the normal lock ordering, used to implement Monitor::Wait - while holding one kThreadWait
diff --git a/runtime/gc/reference_processor.h b/runtime/gc/reference_processor.h
index 54de5cc..51c5054 100644
--- a/runtime/gc/reference_processor.h
+++ b/runtime/gc/reference_processor.h
@@ -100,7 +100,8 @@
// If this is true, then we cannot return a referent (see comment in GetReferent).
bool preserving_references_ GUARDED_BY(Locks::reference_processor_lock_);
// Condition that people wait on if they attempt to get the referent of a reference while
- // processing is in progress.
+ // processing is in progress. Broadcast when an empty checkpoint is requested, but not for other
+ // checkpoints or thread suspensions. See mutator_gc_coord.md.
ConditionVariable condition_ GUARDED_BY(Locks::reference_processor_lock_);
// Reference queues used by the GC.
ReferenceQueue soft_reference_queue_;
diff --git a/runtime/mutator_gc_coord.md b/runtime/mutator_gc_coord.md
index 84327c0..7581ff7 100644
--- a/runtime/mutator_gc_coord.md
+++ b/runtime/mutator_gc_coord.md
@@ -120,7 +120,7 @@
`RunCheckpoint()` does not wait for completion of the function calls triggered by
the resulting `RequestCheckpoint()` invocations.
-**Empty Checkpoints**
+**Empty checkpoints**
: ThreadList provides `RunEmptyCheckpoint()`, which waits until
all threads have either passed a suspend point, or have been suspended. This
ensures that no thread is still executing Java code inside the same
@@ -135,6 +135,73 @@
thread(s) are suspended (again, only in the sense of not running Java code)
when the call returns.
+Deadlock freedom
+----------------
+
+It is easy to deadlock while attempting to run checkpoints, or suspending
+threads. In particular, we need to avoid situations in which we cannot suspend
+a thread because it is blocked, directly, or indirectly, on the GC completing
+its task. Deadlocks are avoided as follows:
+
+**Mutator lock ordering**
+The mutator lock participates in the normal ART lock ordering hierarchy, as though it
+were a regular lock. See `base/locks.h` for the hierarchy. In particular, only
+locks at or below level `kPostMutatorTopLockLevel` may be acquired after
+acquiring the mutator lock, e.g. inside the scope of a `ScopedObjectAccess`.
+Similarly only locks at level strictly above `kMutatatorLock`may be held while
+acquiring the mutator lock, e.g. either by starting a `ScopedObjectAccess`, or
+ending a `ScopedThreadSuspension`.
+
+This ensures that code that uses purely mutexes and threads state changes cannot
+deadlock: Since we always wait on a lower-level lock, the holder of the
+lowest-level lock can always progress. An attempt to initiate a checkpoint or to
+suspend another thread must also be treated as an acquisition of the mutator
+lock: A thread that is waiting for a lock before it can respond to the request
+is itself holding the mutator lock, and can only be blocked on lower-level
+locks. And acquisition of those can never depend on acquiring the mutator
+lock.
+
+**Waiting**
+This becomes much more problematic when we wait for something other than a lock.
+Waiting for something that may depend on the GC, while holding the mutator lock,
+can potentially lead to deadlock, since it will prevent the waiting thread from
+participating in GC checkpoints. Waiting while holding a lower-level lock like
+`thread_list_lock_` is similarly unsafe in general, since a runnable thread may
+not respond to checkpoints until it acquires `thread_list_lock_`. In general,
+waiting for a condition variable while holding an unrelated lock is problematic,
+and these are specific instances of that general problem.
+
+We do currently provide `WaitHoldingLocks`, and it is sometimes used with
+low-level locks held. But such code must somehow ensure that such waits
+eventually terminate without deadlock.
+
+One common use of WaitHoldingLocks is to wait for weak reference processing.
+Special rules apply to avoid deadlocks in this case: Such waits must start after
+weak reference processing is disabled; the GC may not issue further nonempty
+checkpoints or suspend requests until weak reference processing has been
+reenabled, and threads have been notified. Thus the waiting thread's inability
+to respond to nonempty checkpoints and suspend requests cannot directly block
+the GC. Non-GC checkpoint or suspend requests that target a thread waiting on
+reference processing will block until reference processing completes.
+
+Consider a case in which thread W1 waits on reference processing, while holding
+a low-level mutex M. Thread W2 holds the mutator lock and waits on M. We avoid a
+situation in which the GC needs to suspend or checkpoint W2 by briefly stopping
+the world to disable weak reference access. During the stop-the-world phase, W1
+cannot yet be waiting for weak-reference access. Thus there is no danger of
+deadlock while entering this phase. After this phase, there is no need for W2 to
+suspend or execute a nonempty checkpoint. If we replaced the stop-the-world
+phase by a checkpoint, W2 could receive the checkpoint request too late, and be
+unable to respond.
+
+Empty checkpoints can continue to occur during reference processing. Reference
+processing wait loops explicitly handle empty checkpoints, and an empty
+checkpoint request notifies the condition variable used to wait for reference
+processing, after acquiring `reference_processor_lock_`. This means that empty
+checkpoints do not preclude client threads from being in the middle of an
+operation that involves a weak reference access, while nonempty checkpoints do.
+
+
[^1]: Some comments in the code refer to a not-yet-really-implemented scheme in
which the compiler-generated code would load through the address at
`tlsPtr_.suspend_trigger`. A thread suspension is requested by setting this to