Thread suspension cleanup and deadlock fix

Have SuspendAll check that no other SuspendAlls are running, and have it refuse to start if the invoking thread is also being suspended. This limits us to a single SuspendAll call at a time, but that was almost required anyway. It limits us to a single active SuspendAll-related suspend barrier, a large simplification. It appears to me that this avoids some cyclic suspension scenarios that were previously still possible. Move the deadlock-avoidance checks for flip_function to ModifySuspendCount callers instead of failing there. Make single-thread suspension use suspend barriers to avoid the complexity of having to reaccess the thread data structure from another thread while waiting for it to suspend. Add a new data structure to remember the single thread suspension barriers without allocating memory, This uses a linked list of stack allocated data structures, as in MCS locks. The combination of the above avoids a suspend_barrier data structure that can overflow, and removes any possibility of ModifySuspendCount needing to fail and retry. Recombine ModifySuspendCount and ModifySuspendCountInternal. Simplified barrier decrement in PassActiveSuspendBarriers. Strengthened the relaxed memory order, it was probably wrong. Fix the "ignored" logic in SuspendAllInternal. We only ever ignored self, and ResumeAll didn't support anything else anyway. Explicitly assume that the initiating thread, if not null, is registered. Have SuspendAll and friends only ignore self, which was the only actually used case anyway, and ResumeAll was otherwise wrong. Make flip_function atomic<>, since it could be read while being cleared. Remove the poorly used timed_out parameter from the SuspendThreadByX(). Make IsSuspended read with acquire semantics; we often count on having the target thread suspended after that, including having its prior effects on the Java state visible. The TransitionTo... functions already use acquire/release. Shrink the retry loop in RequestSynchronousCheckpoint. Retrying the whole loop appeared to have no benefit over the smaller loop. Clarify the behavior of RunCheckpoint with respect to the mutator lock. Split up ModifySuspendCount into IncrementSuspendCount and DecrementSuspendCount for improved clarity. This is not quite a semantic no-op since it eliminates some redundant work when decrementing a suspend count to a nonzero value. (Thanks to Mythri for the suggestion.) I could not convince myself that RequestCheckpoint returned false only if the target thread was already suspended; there seemed to be no easy way to preclude the state_and_flags compare-exchange failing for other reasons. Yet callers seemed to assume that property. Change the implementation to make that property clearly true. Various trivial cleanups. This hopefully reduces thread suspension deadlocks in general. We've seen a bunch of other bugs that may have been due to the cyclic suspension issues. At least this should make bug diagnosis easier. Test: ./art/test/testrunner/testrunner.py --host --64 -b Test: Build and boot AOSP Bug: 240742796 Bug: 203363895 Bug: 238032384 Change-Id: Ifc2358dd6489c0b92f4673886c98e45974134941
author: Hans Boehm <hboehm@google.com> 2022-07-19 09:07:02 -0700
committer: Hans Boehm <hboehm@google.com> 2022-09-09 18:08:43 +0000
commit: 7c39c86b17c91e651de1fcc0876fe5565e3f5204 (patch)
tree: eaa145c61c1cc65cabeeb5496cd73e4fa45f9887 /runtime/thread_list.h
parent: 3d0d9ab345998eb64b55bdd254c2d001ceec78c6 (diff)
1 files changed, 16 insertions, 13 deletions
diff --git a/runtime/thread_list.h b/runtime/thread_list.h
index 29b0c52186..a17a11b3ce 100644
--- a/runtime/thread_list.h
+++ b/runtime/thread_list.h
@@ -70,7 +70,7 @@ class ThreadList {
   bool Resume(Thread* thread, SuspendReason reason = SuspendReason::kInternal)
       REQUIRES(!Locks::thread_suspend_count_lock_) WARN_UNUSED;
 
-  // Suspends all threads and gets exclusive access to the mutator lock.
+  // Suspends all other threads and gets exclusive access to the mutator lock.
   // If long_suspend is true, then other threads who try to suspend will never timeout.
   // long_suspend is currenly used for hprof since large heaps take a long time.
   void SuspendAll(const char* cause, bool long_suspend = false)
@@ -81,10 +81,8 @@ class ThreadList {
 
   // Suspend a thread using a peer, typically used by the debugger. Returns the thread on success,
   // else null. The peer is used to identify the thread to avoid races with the thread terminating.
-  // If the suspension times out then *timeout is set to true.
   Thread* SuspendThreadByPeer(jobject peer,
-                              SuspendReason reason,
-                              bool* timed_out)
+                              SuspendReason reason)
       REQUIRES(!Locks::mutator_lock_,
                !Locks::thread_list_lock_,
                !Locks::thread_suspend_count_lock_);
@@ -92,8 +90,8 @@ class ThreadList {
   // Suspend a thread using its thread id, typically used by lock/monitor inflation. Returns the
   // thread on success else null. The thread id is used to identify the thread to avoid races with
   // the thread terminating. Note that as thread ids are recycled this may not suspend the expected
-  // thread, that may be terminating. If the suspension times out then *timeout is set to true.
-  Thread* SuspendThreadByThreadId(uint32_t thread_id, SuspendReason reason, bool* timed_out)
+  // thread, that may be terminating.
+  Thread* SuspendThreadByThreadId(uint32_t thread_id, SuspendReason reason)
       REQUIRES(!Locks::mutator_lock_,
                !Locks::thread_list_lock_,
                !Locks::thread_suspend_count_lock_);
@@ -113,8 +111,9 @@ class ThreadList {
   // Running threads are not suspended but run the checkpoint inside of the suspend check. The
   // return value includes already suspended threads for b/24191051. Runs or requests the
   // callback, if non-null, inside the thread_list_lock critical section after determining the
-  // runnable/suspended states of the threads. Does not wait for completion of the callbacks in
-  // running threads.
+  // runnable/suspended states of the threads. Does not wait for completion of the checkpoint
+  // function in running threads. If the caller holds the mutator lock, then all instances of the
+  // checkpoint function are run with the mutator lock.
   size_t RunCheckpoint(Closure* checkpoint_function, Closure* callback = nullptr)
       REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_);
 
@@ -197,12 +196,15 @@ class ThreadList {
       REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_);
 
   void SuspendAllInternal(Thread* self,
-                          Thread* ignore1,
-                          Thread* ignore2 = nullptr,
                           SuspendReason reason = SuspendReason::kInternal)
-      REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_);
+      REQUIRES(!Locks::thread_list_lock_,
+               !Locks::thread_suspend_count_lock_,
+               !Locks::mutator_lock_);
+
+  // Wait for suspend barrier to reach zero. Return false on timeout.
+  bool WaitForSuspendBarrier(AtomicInteger* barrier);
 
-  void AssertThreadsAreSuspended(Thread* self, Thread* ignore1, Thread* ignore2 = nullptr)
+  void AssertOtherThreadsAreSuspended(Thread* self)
       REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_);
 
   std::bitset<kMaxThreadId> allocated_ids_ GUARDED_BY(Locks::allocated_thread_ids_lock_);
@@ -211,6 +213,7 @@ class ThreadList {
   std::list<Thread*> list_ GUARDED_BY(Locks::thread_list_lock_);
 
   // Ongoing suspend all requests, used to ensure threads added to list_ respect SuspendAll.
+  // The value is always either 0 or 1.
   int suspend_all_count_ GUARDED_BY(Locks::thread_suspend_count_lock_);
 
   // Number of threads unregistering, ~ThreadList blocks until this hits 0.
@@ -218,7 +221,7 @@ class ThreadList {
 
   // Thread suspend time histogram. Only modified when all the threads are suspended, so guarding
   // by mutator lock ensures no thread can read when another thread is modifying it.
-  Histogram<uint64_t> suspend_all_historam_ GUARDED_BY(Locks::mutator_lock_);
+  Histogram<uint64_t> suspend_all_histogram_ GUARDED_BY(Locks::mutator_lock_);
 
   // Whether or not the current thread suspension is long.
   bool long_suspend_;
author	Hans Boehm <hboehm@google.com>	2022-07-19 09:07:02 -0700
committer	Hans Boehm <hboehm@google.com>	2022-09-09 18:08:43 +0000
commit	7c39c86b17c91e651de1fcc0876fe5565e3f5204 (patch)
tree	eaa145c61c1cc65cabeeb5496cd73e4fa45f9887 /runtime/thread_list.h
parent	3d0d9ab345998eb64b55bdd254c2d001ceec78c6 (diff)