X86/X86_64: Switch to locked add from mfence

I finally received the answers about the performance of locked add vs.
mfence for Java memory semantics.  Locked add has been faster than
mfence for all processors since the Pentium 4.  Accordingly, I have made
the synchronization use locked add at all times, removing it from an
instruction set feature.

Also add support in the optimizing compiler for barrier type
kNTStoreStore, which is used after non-temporal moves.

Change-Id: Ib47c2fd64c2ff2128ad677f1f39c73444afb8e94
Signed-off-by: Mark Mendell <mark.p.mendell@intel.com>
diff --git a/compiler/optimizing/code_generator_x86_64.h b/compiler/optimizing/code_generator_x86_64.h
index d7ce7c6..ce805cf 100644
--- a/compiler/optimizing/code_generator_x86_64.h
+++ b/compiler/optimizing/code_generator_x86_64.h
@@ -509,10 +509,10 @@
 
   // Ensure that prior stores complete to memory before subsequent loads.
   // The locked add implementation will avoid serializing device memory, but will
-  // touch (but not change) the top of the stack. The locked add should not be used for
-  // ordering non-temporal stores.
+  // touch (but not change) the top of the stack.
+  // The 'non_temporal' parameter should be used to ensure ordering of non-temporal stores.
   void MemoryFence(bool force_mfence = false) {
-    if (!force_mfence && isa_features_.PrefersLockedAddSynchronization()) {
+    if (!force_mfence) {
       assembler_.lock()->addl(Address(CpuRegister(RSP), 0), Immediate(0));
     } else {
       assembler_.mfence();