From b9f02c2f8624bbf0746939e3b2735a1537a567b6 Mon Sep 17 00:00:00 2001 From: Usama Arif Date: Fri, 25 Oct 2019 17:37:33 +0100 Subject: ARM64: FP16.floor() intrinsic for ARMv8 This CL implements an intrinsic for floor() method with ARMv8.2 FP16 instructions. This intrinsic calls a template GenerateFP16Round function which will be used to implement other intrinisics such as ceil and rint. This intrinsic implementation achieves bit-level compatibility with the original Java implementation android.util.Half.floor(). The time required in milliseconds to execute the below code on Pixel3: - Java implementation android.util.Half.floor(): - big cluster only: 18623 - little cluster only: 60424 - arm64 Intrinisic implementation: - big cluster only: 14213 (~24% faster) - little cluster only: 54398 (~10% faster) Analysis of this function with simpleperf showed that approximately only 60-65% of the time is spent in libcore.util.FP16.floor. So the percentage improvement using intrinsics is likely to be more than the numbers stated above. Another reason that the performance improvement with intrinsic is lower than expected is because the java implementation for values between -1 and 1 (abs < 0x3c00) only requires a few instructions and should almost give a similar performance to the intrinsic in this case. In the benchmark function below, 46.8% of the values tested are between -1 and 1. public static short benchmarkFloor(){ short ret = 0; long before = 0; long after = 0; before = System.currentTimeMillis(); for(int i = 0; i < 50000; i++){ for (short h = Short.MIN_VALUE; h < Short.MAX_VALUE; h++) { ret += FP16.floor(h); } } after = System.currentTimeMillis(); System.out.println("Time of FP16.floor (ms): " + (after - before)); System.out.println(ret); return ret; } Test: 580-fp16 Test: art/test/testrunner/run_build_test_target.py -j80 art-test-javac Change-Id: Iad1dd032d456af54932f13c5cf27228f8652a0b5 --- compiler/optimizing/intrinsics_mips.cc | 1 + 1 file changed, 1 insertion(+) (limited to 'compiler/optimizing/intrinsics_mips.cc') diff --git a/compiler/optimizing/intrinsics_mips.cc b/compiler/optimizing/intrinsics_mips.cc index b18bbdde2d..0bab2a0b17 100644 --- a/compiler/optimizing/intrinsics_mips.cc +++ b/compiler/optimizing/intrinsics_mips.cc @@ -2709,6 +2709,7 @@ UNIMPLEMENTED_INTRINSIC(MIPS, CRC32UpdateBytes) UNIMPLEMENTED_INTRINSIC(MIPS, CRC32UpdateByteBuffer) UNIMPLEMENTED_INTRINSIC(MIPS, FP16ToFloat) UNIMPLEMENTED_INTRINSIC(MIPS, FP16ToHalf) +UNIMPLEMENTED_INTRINSIC(MIPS, FP16Floor) UNIMPLEMENTED_INTRINSIC(MIPS, StringStringIndexOf); UNIMPLEMENTED_INTRINSIC(MIPS, StringStringIndexOfAfter); -- cgit v1.2.3-59-g8ed1b