diff options
author | 2022-07-19 09:07:02 -0700 | |
---|---|---|
committer | 2022-09-09 18:08:43 +0000 | |
commit | 7c39c86b17c91e651de1fcc0876fe5565e3f5204 (patch) | |
tree | eaa145c61c1cc65cabeeb5496cd73e4fa45f9887 /runtime/thread_list.h | |
parent | 3d0d9ab345998eb64b55bdd254c2d001ceec78c6 (diff) |
Thread suspension cleanup and deadlock fix
Have SuspendAll check that no other SuspendAlls are running, and have
it refuse to start if the invoking thread is also being suspended.
This limits us to a single SuspendAll call at a time,
but that was almost required anyway. It limits us to a single active
SuspendAll-related suspend barrier, a large simplification.
It appears to me that this avoids some cyclic suspension scenarios
that were previously still possible.
Move the deadlock-avoidance checks for flip_function to
ModifySuspendCount callers instead of failing there.
Make single-thread suspension use suspend barriers to avoid the
complexity of having to reaccess the thread data structure from another
thread while waiting for it to suspend. Add a new data structure to
remember the single thread suspension barriers without allocating
memory, This uses a linked list of stack allocated data structures,
as in MCS locks.
The combination of the above avoids a suspend_barrier data structure
that can overflow, and removes any possibility of ModifySuspendCount
needing to fail and retry. Recombine ModifySuspendCount and
ModifySuspendCountInternal.
Simplified barrier decrement in PassActiveSuspendBarriers.
Strengthened the relaxed memory order, it was probably wrong.
Fix the "ignored" logic in SuspendAllInternal. We only ever ignored
self, and ResumeAll didn't support anything else anyway.
Explicitly assume that the initiating thread, if not null, is
registered. Have SuspendAll and friends only ignore self, which was
the only actually used case anyway, and ResumeAll was otherwise wrong.
Make flip_function atomic<>, since it could be read while being cleared.
Remove the poorly used timed_out parameter from the SuspendThreadByX().
Make IsSuspended read with acquire semantics; we often count on having
the target thread suspended after that, including having its prior
effects on the Java state visible. The TransitionTo... functions already
use acquire/release.
Shrink the retry loop in RequestSynchronousCheckpoint. Retrying the
whole loop appeared to have no benefit over the smaller loop.
Clarify the behavior of RunCheckpoint with respect to the mutator
lock.
Split up ModifySuspendCount into IncrementSuspendCount and
DecrementSuspendCount for improved clarity. This is not quite a
semantic no-op since it eliminates some redundant work when
decrementing a suspend count to a nonzero value. (Thanks to
Mythri for the suggestion.)
I could not convince myself that RequestCheckpoint returned false
only if the target thread was already suspended; there seemed to
be no easy way to preclude the state_and_flags compare-exchange
failing for other reasons. Yet callers seemed to assume that property.
Change the implementation to make that property clearly true.
Various trivial cleanups.
This hopefully reduces thread suspension deadlocks in general.
We've seen a bunch of other bugs that may have been due to the cyclic
suspension issues. At least this should make bug diagnosis easier.
Test: ./art/test/testrunner/testrunner.py --host --64 -b
Test: Build and boot AOSP
Bug: 240742796
Bug: 203363895
Bug: 238032384
Change-Id: Ifc2358dd6489c0b92f4673886c98e45974134941
Diffstat (limited to 'runtime/thread_list.h')
-rw-r--r-- | runtime/thread_list.h | 29 |
1 files changed, 16 insertions, 13 deletions
diff --git a/runtime/thread_list.h b/runtime/thread_list.h index 29b0c52186..a17a11b3ce 100644 --- a/runtime/thread_list.h +++ b/runtime/thread_list.h @@ -70,7 +70,7 @@ class ThreadList { bool Resume(Thread* thread, SuspendReason reason = SuspendReason::kInternal) REQUIRES(!Locks::thread_suspend_count_lock_) WARN_UNUSED; - // Suspends all threads and gets exclusive access to the mutator lock. + // Suspends all other threads and gets exclusive access to the mutator lock. // If long_suspend is true, then other threads who try to suspend will never timeout. // long_suspend is currenly used for hprof since large heaps take a long time. void SuspendAll(const char* cause, bool long_suspend = false) @@ -81,10 +81,8 @@ class ThreadList { // Suspend a thread using a peer, typically used by the debugger. Returns the thread on success, // else null. The peer is used to identify the thread to avoid races with the thread terminating. - // If the suspension times out then *timeout is set to true. Thread* SuspendThreadByPeer(jobject peer, - SuspendReason reason, - bool* timed_out) + SuspendReason reason) REQUIRES(!Locks::mutator_lock_, !Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_); @@ -92,8 +90,8 @@ class ThreadList { // Suspend a thread using its thread id, typically used by lock/monitor inflation. Returns the // thread on success else null. The thread id is used to identify the thread to avoid races with // the thread terminating. Note that as thread ids are recycled this may not suspend the expected - // thread, that may be terminating. If the suspension times out then *timeout is set to true. - Thread* SuspendThreadByThreadId(uint32_t thread_id, SuspendReason reason, bool* timed_out) + // thread, that may be terminating. + Thread* SuspendThreadByThreadId(uint32_t thread_id, SuspendReason reason) REQUIRES(!Locks::mutator_lock_, !Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_); @@ -113,8 +111,9 @@ class ThreadList { // Running threads are not suspended but run the checkpoint inside of the suspend check. The // return value includes already suspended threads for b/24191051. Runs or requests the // callback, if non-null, inside the thread_list_lock critical section after determining the - // runnable/suspended states of the threads. Does not wait for completion of the callbacks in - // running threads. + // runnable/suspended states of the threads. Does not wait for completion of the checkpoint + // function in running threads. If the caller holds the mutator lock, then all instances of the + // checkpoint function are run with the mutator lock. size_t RunCheckpoint(Closure* checkpoint_function, Closure* callback = nullptr) REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_); @@ -197,12 +196,15 @@ class ThreadList { REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_); void SuspendAllInternal(Thread* self, - Thread* ignore1, - Thread* ignore2 = nullptr, SuspendReason reason = SuspendReason::kInternal) - REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_); + REQUIRES(!Locks::thread_list_lock_, + !Locks::thread_suspend_count_lock_, + !Locks::mutator_lock_); + + // Wait for suspend barrier to reach zero. Return false on timeout. + bool WaitForSuspendBarrier(AtomicInteger* barrier); - void AssertThreadsAreSuspended(Thread* self, Thread* ignore1, Thread* ignore2 = nullptr) + void AssertOtherThreadsAreSuspended(Thread* self) REQUIRES(!Locks::thread_list_lock_, !Locks::thread_suspend_count_lock_); std::bitset<kMaxThreadId> allocated_ids_ GUARDED_BY(Locks::allocated_thread_ids_lock_); @@ -211,6 +213,7 @@ class ThreadList { std::list<Thread*> list_ GUARDED_BY(Locks::thread_list_lock_); // Ongoing suspend all requests, used to ensure threads added to list_ respect SuspendAll. + // The value is always either 0 or 1. int suspend_all_count_ GUARDED_BY(Locks::thread_suspend_count_lock_); // Number of threads unregistering, ~ThreadList blocks until this hits 0. @@ -218,7 +221,7 @@ class ThreadList { // Thread suspend time histogram. Only modified when all the threads are suspended, so guarding // by mutator lock ensures no thread can read when another thread is modifying it. - Histogram<uint64_t> suspend_all_historam_ GUARDED_BY(Locks::mutator_lock_); + Histogram<uint64_t> suspend_all_histogram_ GUARDED_BY(Locks::mutator_lock_); // Whether or not the current thread suspension is long. bool long_suspend_; |