7c39c86b17c91e651de1fcc0876fe5565e3f5204 - LeafOS-Project/android_art

commit	7c39c86b17c91e651de1fcc0876fe5565e3f5204	[log] [tgz]
author	Hans Boehm <hboehm@google.com>	Tue Jul 19 09:07:02 2022 -0700
committer	Hans Boehm <hboehm@google.com>	Fri Sep 09 18:08:43 2022 +0000
tree	eaa145c61c1cc65cabeeb5496cd73e4fa45f9887
parent	3d0d9ab345998eb64b55bdd254c2d001ceec78c6 [diff]

Thread suspension cleanup and deadlock fix

Have SuspendAll check that no other SuspendAlls are running, and have
it refuse to start if the invoking thread is also being suspended.
This limits us to a single SuspendAll call at a time,
but that was almost required anyway. It limits us to a single active
SuspendAll-related suspend barrier, a large simplification.
It appears to me that this avoids some cyclic suspension scenarios
that were previously still possible.

Move the deadlock-avoidance checks for flip_function to
ModifySuspendCount callers instead of failing there.

Make single-thread suspension use suspend barriers to avoid the
complexity of having to reaccess the thread data structure from another
thread while waiting for it to suspend. Add a new data structure to
remember the single thread suspension barriers without allocating
memory, This uses a linked list of stack allocated data structures,
as in MCS locks.

The combination of the above avoids a suspend_barrier data structure
that can overflow, and removes any possibility of ModifySuspendCount
needing to fail and retry. Recombine ModifySuspendCount and
ModifySuspendCountInternal.

Simplified barrier decrement in PassActiveSuspendBarriers.
Strengthened the relaxed memory order, it was probably wrong.

Fix the "ignored" logic in SuspendAllInternal. We only ever ignored
self, and ResumeAll didn't support anything else anyway.

Explicitly assume that the initiating thread, if not null, is
registered. Have SuspendAll and friends only ignore self, which was
the only actually used case anyway, and ResumeAll was otherwise wrong.

Make flip_function atomic<>, since it could be read while being cleared.

Remove the poorly used timed_out parameter from the SuspendThreadByX().

Make IsSuspended read with acquire semantics; we often count on having
the target thread suspended after that, including having its prior
effects on the Java state visible. The TransitionTo... functions already
use acquire/release.

Shrink the retry loop in RequestSynchronousCheckpoint. Retrying the
whole loop appeared to have no benefit over the smaller loop.

Clarify the behavior of RunCheckpoint with respect to the mutator
lock.

Split up ModifySuspendCount into IncrementSuspendCount and
DecrementSuspendCount for improved clarity. This is not quite a
semantic no-op since it eliminates some redundant work when
decrementing a suspend count to a nonzero value. (Thanks to
Mythri for the suggestion.)

I could not convince myself that RequestCheckpoint returned false
only if the target thread was already suspended; there seemed to
be no easy way to preclude the state_and_flags compare-exchange
failing for other reasons. Yet callers seemed to assume that property.
Change the implementation to make that property clearly true.

Various trivial cleanups.

This hopefully reduces thread suspension deadlocks in general.
We've seen a bunch of other bugs that may have been due to the cyclic
suspension issues. At least this should make bug diagnosis easier.

Test: ./art/test/testrunner/testrunner.py --host --64  -b
Test: Build and boot AOSP
Bug: 240742796
Bug: 203363895
Bug: 238032384
Change-Id: Ifc2358dd6489c0b92f4673886c98e45974134941

16 files changed