ART: Improve VisitStringGetCharsNoCheck intrinsic for compressed strings, using SIMD
The previous implementation of VisitStringGetCharsNoCheck
copies one character at a time for compressed strings (that
use 8 bits per char).
Instead, use SIMD instructions to copy 8 chars at once
where possible.
On a Pixel 3 phone:
Microbenchmarks for getCharsNoCheck on varying string
lengths show a speedup of up to 80% (big cores) and
70% (little cores) on long strings, and around 30% (big)
and 20% (little) on strings of only 8 characters.
The overhead for strings of < 8 characters is ~3%,
and is immediately amortized for strings of more
than 8 characters.
Dhrystone shows a consistent speedup of around 6% (big)
and 4% (little).
The getCharsNoCheck intrinsic is used by the StringBuilder
append() method, which is used by the String concatenate
operator ('+').
Image size change:
Before:
boot-core-libart.oat: 549040
boot.oat: 3789080
boot-framework.oat: 13356576
After:
boot-core-libart.oat: 549024 (-16B)
boot.oat: 3789144 (+64B)
boot-framework.oat: 13356576 (+ 0B)
Test: test_art_target.sh, test_art_host.sh
Test: 536-checker-intrinsic-optimization
Change-Id: I865e3df6d4725e151ae195a86e02e090dae8dd29
diff --git a/compiler/optimizing/intrinsics_arm64.cc b/compiler/optimizing/intrinsics_arm64.cc
index 1fab712..185d487 100644
--- a/compiler/optimizing/intrinsics_arm64.cc
+++ b/compiler/optimizing/intrinsics_arm64.cc
@@ -1960,7 +1960,8 @@
Register tmp2 = temps.AcquireX();
vixl::aarch64::Label done;
- vixl::aarch64::Label compressed_string_loop;
+ vixl::aarch64::Label compressed_string_vector_loop;
+ vixl::aarch64::Label compressed_string_remainder;
__ Sub(num_chr, srcEnd, srcBegin);
// Early out for valid zero-length retrievals.
__ Cbz(num_chr, &done);
@@ -2013,16 +2014,39 @@
__ B(&done);
if (mirror::kUseStringCompression) {
+ // For compressed strings, acquire a SIMD temporary register.
+ FPRegister vtmp1 = temps.AcquireVRegisterOfSize(kQRegSize);
const size_t c_char_size = DataType::Size(DataType::Type::kInt8);
DCHECK_EQ(c_char_size, 1u);
__ Bind(&compressed_string_preloop);
__ Add(src_ptr, src_ptr, Operand(srcBegin));
- // Copy loop for compressed src, copying 1 character (8-bit) to (16-bit) at a time.
- __ Bind(&compressed_string_loop);
+
+ // Save repairing the value of num_chr on the < 8 character path.
+ __ Subs(tmp1, num_chr, 8);
+ __ B(lt, &compressed_string_remainder);
+
+ // Keep the result of the earlier subs, we are going to fetch at least 8 characters.
+ __ Mov(num_chr, tmp1);
+
+ // Main loop for compressed src, copying 8 characters (8-bit) to (16-bit) at a time.
+ // Uses SIMD instructions.
+ __ Bind(&compressed_string_vector_loop);
+ __ Ld1(vtmp1.V8B(), MemOperand(src_ptr, c_char_size * 8, PostIndex));
+ __ Subs(num_chr, num_chr, 8);
+ __ Uxtl(vtmp1.V8H(), vtmp1.V8B());
+ __ St1(vtmp1.V8H(), MemOperand(dst_ptr, char_size * 8, PostIndex));
+ __ B(ge, &compressed_string_vector_loop);
+
+ __ Adds(num_chr, num_chr, 8);
+ __ B(eq, &done);
+
+ // Loop for < 8 character case and remainder handling with a compressed src.
+ // Copies 1 character (8-bit) to (16-bit) at a time.
+ __ Bind(&compressed_string_remainder);
__ Ldrb(tmp1, MemOperand(src_ptr, c_char_size, PostIndex));
__ Strh(tmp1, MemOperand(dst_ptr, char_size, PostIndex));
__ Subs(num_chr, num_chr, Operand(1));
- __ B(gt, &compressed_string_loop);
+ __ B(gt, &compressed_string_remainder);
}
__ Bind(&done);