From 41de45060710d64b671a0fa001ec187df221359d Mon Sep 17 00:00:00 2001
From: Vladimir Marko <vmarko@google.com>
Date: Tue, 21 May 2019 10:00:15 +0100
Subject: StringBuilder append pattern for float/double.

Results for added benchmarks on blueline-userdebug with cpu
frequencies fxed at 1420800 (cpus 0-3; little) and 1459200
(cpus 4-7; big):
32-bit little (--variant=X32 --invoke-with 'taskset 0f')
  timeAppendStringAndDouble: ~1260ns -> ~970ns
  timeAppendStringAndFloat: ~1250ns -> ~940ns
  timeAppendStringAndHugeDouble: ~4700ns -> ~4690ns (noise)
  timeAppendStringAndHugeFloat: ~3400ns -> ~3300ns (noise)
  timeAppendStringDoubleStringAndFloat: ~1980ns -> ~1550ns
64-bit little (--variant=X64 --invoke-with 'taskset 0f')
  timeAppendStringAndDouble: ~1260ns -> ~970ns
  timeAppendStringAndFloat: ~1260ns -> ~940ns
  timeAppendStringAndHugeDouble: ~4700ns -> ~4800ns (noise)
  timeAppendStringAndHugeFloat: ~3300ns -> ~3400ns (noise)
  timeAppendStringDoubleStringAndFloat: ~1970ns -> ~1550ns
32-bit big (--variant=X32 --invoke-with 'taskset f0')
  timeAppendStringAndDouble: ~580ns -> ~450ns
  timeAppendStringAndFloat: ~590ns -> ~430ns
  timeAppendStringAndHugeDouble: ~2500ns -> ~2100ns (noise)
  timeAppendStringAndHugeFloat: ~1500ns -> ~1300ns (noise)
  timeAppendStringDoubleStringAndFloat: ~880ns -> ~730ns
64-bit big (--variant=X64 --invoke-with 'taskset f0')
  timeAppendStringAndDouble: ~590ns -> ~450ns
  timeAppendStringAndFloat: ~590ns -> ~430ns
  timeAppendStringAndHugeDouble: ~2300ns -> ~2300ns (noise)
  timeAppendStringAndHugeFloat: ~1500ns -> ~1300ns (noise)
  timeAppendStringDoubleStringAndFloat: ~870ns -> ~730ns

The `timeAppendStringAnd{Double,Float)` benchmarks show very
nice improvements, roughly 25% on both little and big cores.
The `timeAppendStringDoubleStringAndFloat` also shows decent
improvements, over 20% on little and over 15% on big cores.
(These benchmarks test the best-case scenario for "before"
as the StringBuilder's internal buffer is not reallocated.)

The `testAppendStringAndHuge{Double,Float}` results are too
noisy to draw any conclusions (especially on little cores
but there is still too much noise on big cores as well).

There are also small regressions for existing benchmarks
`timeAppend{LongStrings,StringAndInt,Strings}` but these
non-FP regressions may be mitigated after updating the
ThinLTO profile.

There is also an opportunity to optimize the calls back
to managed code for known shorty (in this change we use
"LD" and "LF") by using a dedicated stub instead of going
through the generic invoke stub.

Boot image size changes are insignificant (few matches).

Test: Added tests to 697-checker-string-append
Test: m test-art-host-gtest
Test: testrunner.py --host --optimizing
Test: testrunner.py --target --optimizing
Bug: 19575890
Change-Id: I9cf38c2d615a0a2b14255d18588a694d8870aae5
---
 compiler/optimizing/nodes.h | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

(limited to 'compiler/optimizing/nodes.h')

diff --git a/compiler/optimizing/nodes.h b/compiler/optimizing/nodes.h
index 591087bae6..cbb55918cf 100644
--- a/compiler/optimizing/nodes.h
+++ b/compiler/optimizing/nodes.h
@@ -7503,14 +7503,17 @@ class HStringBuilderAppend final : public HVariableInputSizeInstruction {
  public:
   HStringBuilderAppend(HIntConstant* format,
                        uint32_t number_of_arguments,
+                       bool has_fp_args,
                        ArenaAllocator* allocator,
                        uint32_t dex_pc)
       : HVariableInputSizeInstruction(
             kStringBuilderAppend,
             DataType::Type::kReference,
-            // The runtime call may read memory from inputs. It never writes outside
-            // of the newly allocated result object (or newly allocated helper objects).
-            SideEffects::AllReads().Union(SideEffects::CanTriggerGC()),
+            SideEffects::CanTriggerGC().Union(
+                // The runtime call may read memory from inputs. It never writes outside
+                // of the newly allocated result object or newly allocated helper objects,
+                // except for float/double arguments where we reuse thread-local helper objects.
+                has_fp_args ? SideEffects::AllWritesAndReads() : SideEffects::AllReads()),
             dex_pc,
             allocator,
             number_of_arguments + /* format */ 1u,
-- 
cgit v1.2.3-59-g8ed1b