From 252254b130067cd7a5071865e793966871ae0246 Mon Sep 17 00:00:00 2001 From: buzbee Date: Sun, 8 Sep 2013 16:20:53 -0700 Subject: More Quick compile-time tuning: labels & branches This CL represents a roughly 3.5% performance improvement for the compile phase of dex2oat. Move of the gain comes from avoiding the generation of dex boundary LIR labels unless a debug listing is requested. The other significant change is moving from a basic block ending branch model of "always generate a fall-through branch, and then delete it if we can" to a "only generate a fall-through branch if we need it" model. The data motivating these changes follow. Note that two area of potentially attractive gain remain: restructing the assembler model and reworking the register handling utilities. These will be addressed in subsequent CLs. --- data follows The Quick compiler's assembler has shown up on profile reports a bit more than seems reasonable. We've tried a few quick fixes to apparently hot portions of the code, but without much gain. So, I've been looking at the assembly process at a somewhat higher level. There look to be several potentially good opportunities. First, an analysis of the makeup of the LIR graph showed a surprisingly high proportion of LIR pseudo ops. Using the boot classpath as a basis, we get: 32.8% of all LIR nodes are pseudo ops. 10.4% are LIR instructions which require pc-relative fixups. 11.8% are LIR instructions that have been nop'd by the various optimization passes. Looking only at the LIR pseudo ops, we get: kPseudoDalvikByteCodeBoundary 43.46% kPseudoNormalBlockLabel 21.14% kPseudoSafepointPC 20.20% kPseudoThrowTarget 6.94% kPseudoTarget 3.03% kPseudoSuspendTarget 1.95% kPseudoMethodExit 1.26% kPseudoMethodEntry 1.26% kPseudoExportedPC 0.37% kPseudoCaseLabel 0.30% kPseudoBarrier 0.07% kPseudoIntrinsicRetry 0.02% Total LIR count: 10167292 The standout here is the Dalvik opcode boundary marker. This is just a label inserted at the beginning of the codegen for each Dalvik bytecode. If we're also doing a verbose listing, this is also where we hang the pretty-print disassembly string. However, this label was also being used as a convenient way to find the target of switch case statements (and, I think at one point was used in the Mir->GBC conversion process). This CL moves the use of kPseudoDalvikByteCodeBoundary labels to only verbose listing runs, and replaces the codegen uses of the label with the kPseudoNormalBlockLabel attached to the basic block that contains the switch case target. Great savings here - 14.3% reduction in the number of LIR nodes needed. After this CL, our LIR pseudo proportions drop to 21.6% of all LIR. That's still a lot, but much better. Possible further improvements via combining normal labels with kPseudoSafepointPC labels where appropriate, and also perhaps reduce memory usage by using a short-hand form for labels rather than a full LIR node. Also, many of the basic block labels are no longer branch targets by the time we get to assembly - cheaper to delete, or just ingore? Here's the "after" LIR pseudo op breakdown: kPseudoNormalBlockLabel 37.39% kPseudoSafepointPC 35.72% kPseudoThrowTarget 12.28% kPseudoTarget 5.36% kPseudoSuspendTarget 3.45% kPseudoMethodEntry 2.24% kPseudoMethodExit 2.22% kPseudoExportedPC 0.65% kPseudoCaseLabel 0.53% kPseudoBarrier 0.12% kPseudoIntrinsicRetry 0.04% Total LIR count: 5748232 Not done in this CL, but it will be worth experimenting with actually deleting LIR nodes from the graph when they are optimized away, rather than just setting the NOP bit. Keeping them around is invaluable during debugging - but when not debugging it may pay off if the cost of node removal is less than the cost of traversing through dead nodes in subsequent passes. Next up (and partially in this CL - but mostly to be done in follow-on CLs) is the overall assembly process. Inherited from the trace JIT, the Quick compiler has a fairly simple-minded approach to instruction assembly. First, a pass is made over the LIR list to assign offsets to each instruction. Then, the assembly pass is made - which generates the actual machine instruction bit patterns and pushes the instruction data into the code_buffer. However, the code generator takes the "always optimistic" approach to instruction selection and emits the shortest instruction. If, during assembly, we find that a branch or load doesn't reach, that short-form instruction is replaces with a longer sequence. Of course, this invalidates the previously-computed offset calculations. Assembly thus is an iterative process: compute offsets and then assemble until we survive an assembly pass without invalidation. This seems like a likely candidate for improvement. First, I analyzed the number of retries required, and the reason for invalidation over the boot classpath load. The results: more than half of methods don't require a retry, and very few require more than 1 extra pass: 5 or more: 6 of 96334 4 or more: 22 of 96334 3 or more: 140 of 96334 2 or more: 1794 of 96334 - 2% 1 or more: 40911 of 96334 - 40% 0 retries: 55423 of 96334 - 58% The interesting group here is the one that requires 1 retry. Looking at the reason, we see three typical reasons: 1. A cbnz/cbz doesn't reach (only 7 bits of offset) 2. A 16-bit Thumb1 unconditional branch doesn't reach. 3. An unconditional branch which branches to the next instruction is encountered, and deleted. The first 2 cases are the cost of the optimistic strategy - nothing much to change there. However, the interesting case is #3 - dead branch elimination. A further analysis of the single retry group showed that 42% of the methods (16305) that required a single retry did so *only* because of dead branch elimination. The big question here is why so many dead branches survive to the assembly stage. We have a dead branch elimination pass which is supposed to catch these - perhaps it's not working correctly, should be moved later in the optimization process, or perhaps run multiple times. Other things to consider: o Combine the offset generation pass with the assembly pass. Skip pc-relative fixup assembly (other than assigning offset), but push LIR* for them into work list. Following the main pass, zip through the work list and assemble the pc-relative instructions (now that we know the offsets). This would significantly cut back on traversal costs. o Store the assembled bits into both the code buffer and the LIR. In the event we have to retry, only the pc-relative instructions would need to be assembled, and we'd finish with a pass over the LIR just to dumb the bits into the code buffer. Change-Id: I50029d216fa14f273f02b6f1c8b6a0dde5a7d6a6 --- compiler/dex/quick/codegen_util.cc | 54 +++++++++++++++++++++----------------- 1 file changed, 30 insertions(+), 24 deletions(-) (limited to 'compiler/dex/quick/codegen_util.cc') diff --git a/compiler/dex/quick/codegen_util.cc b/compiler/dex/quick/codegen_util.cc index e9db6a68d0..3aa2f647c2 100644 --- a/compiler/dex/quick/codegen_util.cc +++ b/compiler/dex/quick/codegen_util.cc @@ -55,9 +55,32 @@ bool Mir2Lir::FastInstance(uint32_t field_idx, bool is_put, int* field_offset, b field_idx, mir_graph_->GetCurrentDexCompilationUnit(), is_put, field_offset, is_volatile); } +/* Remove a LIR from the list. */ +void Mir2Lir::UnlinkLIR(LIR* lir) { + if (UNLIKELY(lir == first_lir_insn_)) { + first_lir_insn_ = lir->next; + if (lir->next != NULL) { + lir->next->prev = NULL; + } else { + DCHECK(lir->next == NULL); + DCHECK(lir == last_lir_insn_); + last_lir_insn_ = NULL; + } + } else if (lir == last_lir_insn_) { + last_lir_insn_ = lir->prev; + lir->prev->next = NULL; + } else if ((lir->prev != NULL) && (lir->next != NULL)) { + lir->prev->next = lir->next; + lir->next->prev = lir->prev; + } +} + /* Convert an instruction to a NOP */ void Mir2Lir::NopLIR(LIR* lir) { lir->flags.is_nop = true; + if (!cu_->verbose) { + UnlinkLIR(lir); + } } void Mir2Lir::SetMemRefType(LIR* lir, bool is_load, int mem_type) { @@ -782,20 +805,16 @@ void Mir2Lir::AssembleLIR() { /* * Insert a kPseudoCaseLabel at the beginning of the Dalvik * offset vaddr. This label will be used to fix up the case - * branch table during the assembly phase. Be sure to set - * all resource flags on this to prevent code motion across - * target boundaries. KeyVal is just there for debugging. + * branch table during the assembly phase. All resource flags + * are set to prevent code motion. KeyVal is just there for debugging. */ LIR* Mir2Lir::InsertCaseLabel(int vaddr, int keyVal) { - SafeMap::iterator it; - LIR* boundary_lir = boundary_map_.Get(vaddr); - if (boundary_lir == NULL) { - LOG(FATAL) << "Error: didn't find vaddr 0x" << std::hex << vaddr; - } + LIR* boundary_lir = &block_label_list_[mir_graph_->FindBlock(vaddr)->id]; LIR* new_label = static_cast(arena_->Alloc(sizeof(LIR), ArenaAllocator::kAllocLIR)); new_label->dalvik_offset = vaddr; new_label->opcode = kPseudoCaseLabel; new_label->operands[0] = keyVal; + new_label->def_mask = ENCODE_ALL; InsertLIRAfter(boundary_lir, new_label); return new_label; } @@ -880,18 +899,9 @@ void Mir2Lir::DumpPackedSwitchTable(const uint16_t* table) { } } -/* - * Set up special LIR to mark a Dalvik byte-code instruction start and - * record it in the boundary_map. NOTE: in cases such as kMirOpCheck in - * which we split a single Dalvik instruction, only the first MIR op - * associated with a Dalvik PC should be entered into the map. - */ -LIR* Mir2Lir::MarkBoundary(int offset, const char* inst_str) { - LIR* res = NewLIR1(kPseudoDalvikByteCodeBoundary, reinterpret_cast(inst_str)); - if (boundary_map_.Get(offset) == NULL) { - boundary_map_.Put(offset, res); - } - return res; +/* Set up special LIR to mark a Dalvik byte-code instruction start for pretty printing */ +void Mir2Lir::MarkBoundary(int offset, const char* inst_str) { + NewLIR1(kPseudoDalvikByteCodeBoundary, reinterpret_cast(inst_str)); } bool Mir2Lir::EvaluateBranch(Instruction::Code opcode, int32_t src1, int32_t src2) { @@ -946,7 +956,6 @@ Mir2Lir::Mir2Lir(CompilationUnit* cu, MIRGraph* mir_graph, ArenaAllocator* arena throw_launchpads_(arena, 2048, kGrowableArrayThrowLaunchPads), suspend_launchpads_(arena, 4, kGrowableArraySuspendLaunchPads), intrinsic_launchpads_(arena, 2048, kGrowableArrayMisc), - boundary_map_(arena, 0, kGrowableArrayMisc), data_offset_(0), total_size_(0), block_label_list_(NULL), @@ -963,8 +972,6 @@ Mir2Lir::Mir2Lir(CompilationUnit* cu, MIRGraph* mir_graph, ArenaAllocator* arena promotion_map_ = static_cast (arena_->Alloc((cu_->num_dalvik_registers + cu_->num_compiler_temps + 1) * sizeof(promotion_map_[0]), ArenaAllocator::kAllocRegAlloc)); - // Pre-fill with nulls. - boundary_map_.SetSize(cu->code_item->insns_size_in_code_units_); } void Mir2Lir::Materialize() { @@ -1091,5 +1098,4 @@ void Mir2Lir::InsertLIRAfter(LIR* current_lir, LIR* new_lir) { new_lir->next->prev = new_lir; } - } // namespace art -- cgit v1.2.3-59-g8ed1b