From 7c39c86b17c91e651de1fcc0876fe5565e3f5204 Mon Sep 17 00:00:00 2001 From: Hans Boehm Date: Tue, 19 Jul 2022 09:07:02 -0700 Subject: Thread suspension cleanup and deadlock fix Have SuspendAll check that no other SuspendAlls are running, and have it refuse to start if the invoking thread is also being suspended. This limits us to a single SuspendAll call at a time, but that was almost required anyway. It limits us to a single active SuspendAll-related suspend barrier, a large simplification. It appears to me that this avoids some cyclic suspension scenarios that were previously still possible. Move the deadlock-avoidance checks for flip_function to ModifySuspendCount callers instead of failing there. Make single-thread suspension use suspend barriers to avoid the complexity of having to reaccess the thread data structure from another thread while waiting for it to suspend. Add a new data structure to remember the single thread suspension barriers without allocating memory, This uses a linked list of stack allocated data structures, as in MCS locks. The combination of the above avoids a suspend_barrier data structure that can overflow, and removes any possibility of ModifySuspendCount needing to fail and retry. Recombine ModifySuspendCount and ModifySuspendCountInternal. Simplified barrier decrement in PassActiveSuspendBarriers. Strengthened the relaxed memory order, it was probably wrong. Fix the "ignored" logic in SuspendAllInternal. We only ever ignored self, and ResumeAll didn't support anything else anyway. Explicitly assume that the initiating thread, if not null, is registered. Have SuspendAll and friends only ignore self, which was the only actually used case anyway, and ResumeAll was otherwise wrong. Make flip_function atomic<>, since it could be read while being cleared. Remove the poorly used timed_out parameter from the SuspendThreadByX(). Make IsSuspended read with acquire semantics; we often count on having the target thread suspended after that, including having its prior effects on the Java state visible. The TransitionTo... functions already use acquire/release. Shrink the retry loop in RequestSynchronousCheckpoint. Retrying the whole loop appeared to have no benefit over the smaller loop. Clarify the behavior of RunCheckpoint with respect to the mutator lock. Split up ModifySuspendCount into IncrementSuspendCount and DecrementSuspendCount for improved clarity. This is not quite a semantic no-op since it eliminates some redundant work when decrementing a suspend count to a nonzero value. (Thanks to Mythri for the suggestion.) I could not convince myself that RequestCheckpoint returned false only if the target thread was already suspended; there seemed to be no easy way to preclude the state_and_flags compare-exchange failing for other reasons. Yet callers seemed to assume that property. Change the implementation to make that property clearly true. Various trivial cleanups. This hopefully reduces thread suspension deadlocks in general. We've seen a bunch of other bugs that may have been due to the cyclic suspension issues. At least this should make bug diagnosis easier. Test: ./art/test/testrunner/testrunner.py --host --64 -b Test: Build and boot AOSP Bug: 240742796 Bug: 203363895 Bug: 238032384 Change-Id: Ifc2358dd6489c0b92f4673886c98e45974134941 --- runtime/entrypoints_order_test.cc | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) (limited to 'runtime/entrypoints_order_test.cc') diff --git a/runtime/entrypoints_order_test.cc b/runtime/entrypoints_order_test.cc index 2cd58dbf1b..8821aed259 100644 --- a/runtime/entrypoints_order_test.cc +++ b/runtime/entrypoints_order_test.cc @@ -111,13 +111,15 @@ class EntrypointsOrderTest : public CommonRuntimeTest { sizeof(void*)); EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, last_no_thread_suspension_cause, checkpoint_function, sizeof(void*)); - EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, checkpoint_function, active_suspend_barriers, + EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, checkpoint_function, active_suspendall_barrier, + sizeof(void*)); + EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, active_suspendall_barrier, active_suspend1_barriers, + sizeof(void*)); + EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, active_suspend1_barriers, thread_local_pos, sizeof(void*)); - EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, active_suspend_barriers, thread_local_start, - sizeof(Thread::tls_ptr_sized_values::active_suspend_barriers)); - EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, thread_local_start, thread_local_pos, sizeof(void*)); EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, thread_local_pos, thread_local_end, sizeof(void*)); - EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, thread_local_end, thread_local_limit, sizeof(void*)); + EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, thread_local_end, thread_local_start, sizeof(void*)); + EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, thread_local_start, thread_local_limit, sizeof(void*)); EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, thread_local_limit, thread_local_objects, sizeof(void*)); EXPECT_OFFSET_DIFFP(Thread, tlsPtr_, thread_local_objects, jni_entrypoints, sizeof(size_t)); -- cgit v1.2.3-59-g8ed1b