Use partial TLAB regions

Instead of having 256K TLAB regions, have 256K TLABs split into
16K regions. This fixes pathological cases with multithreaded
allocation that caused many GCs since each thread reserving
256K would often bump the counter past the GC start threshold. Now
threads only bump the counter every 16K.

System wide results (average of 5 samples on N6P):
Total GC time 60s after starting shell: 45s -> 24s
Average .Heap PSS 60s after starting shell: 57900k -> 58682k

BinaryTrees gets around 5% slower, numbers are noisy.

Boot time: 13.302 -> 12.899 (average of 100 runs)

Bug: 35872915
Bug: 36216292

Test: test-art-host

(cherry picked from commit bf48003fa32d2845f2213c0ba31af6677715662d)

Change-Id: I5ab22420124eeadc0a53519c70112274101dfb39
diff --git a/runtime/gc/heap-inl.h b/runtime/gc/heap-inl.h
index 394e541..a50d125 100644
--- a/runtime/gc/heap-inl.h
+++ b/runtime/gc/heap-inl.h
@@ -77,12 +77,11 @@
   size_t bytes_allocated;
   size_t usable_size;
   size_t new_num_bytes_allocated = 0;
-  if (allocator == kAllocatorTypeTLAB || allocator == kAllocatorTypeRegionTLAB) {
+  if (IsTLABAllocator(allocator)) {
     byte_count = RoundUp(byte_count, space::BumpPointerSpace::kAlignment);
   }
   // If we have a thread local allocation we don't need to update bytes allocated.
-  if ((allocator == kAllocatorTypeTLAB || allocator == kAllocatorTypeRegionTLAB) &&
-      byte_count <= self->TlabSize()) {
+  if (IsTLABAllocator(allocator) && byte_count <= self->TlabSize()) {
     obj = self->AllocTlab(byte_count);
     DCHECK(obj != nullptr) << "AllocTlab can't fail";
     obj->SetClass(klass);