//===----------------------------------------------------------------------===// // Better Optimization for Structure Copies //===----------------------------------------------------------------------===// 2/8/2012 - initial revision Copies of C structures and C++ by-value classes are very common, particularly for C++ types with value semantics. A common cause of C++ "abstraction penalty" comes from cases where the optimizer is not able to adequately optimize away structure temporaries. This note describes the case when one of these temporaries is copied, causing the front-end to emit an llvm.memcpy call. Overall, LLVM does a pretty good job at optimizing away memcpy and memmove calls. It can perform dead store elimination, forward propagate values through these transfers, can form them out of scalar stores and loops, and can even shrink them when it can detect that they are partially dead. However, one case that it does not handle well is when a struct has padding in it. Consider the following struct and function: struct foo { int x; char y; int a,b,c,d,e,f,g,h,i; }; struct foo a; void test() { struct foo tmp = a; tmp.b = 4; a = tmp; } At -O0, clang compiles this into this straight-forward IR: %struct.foo = type { i32, i8, i32, i32, i32, i32, i32, i32, i32, i32, i32 } @a = common global %struct.foo zeroinitializer, align 4 define void @test() nounwind uwtable ssp { entry: %tmp = alloca %struct.foo, align 4 %0 = bitcast %struct.foo* %tmp to i8* call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast (%struct.foo* @a to i8*), i64 44, i32 4, i1 false) %b = getelementptr inbounds %struct.foo* %tmp, i32 0, i32 3 store i32 4, i32* %b, align 4 %1 = bitcast %struct.foo* %tmp to i8* call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast (%struct.foo* @a to i8*), i8* %1, i64 44, i32 4, i1 false) ret void } However, things get messy when the optimizer is turned on - we get this code: define void @test() nounwind uwtable ssp { entry: %srcval1 = load i352* bitcast (%struct.foo* @a to i352*), align 4 %mask = and i352 %srcval1, -340282366841710300949110269838224261121 %ins = or i352 %mask, 316912650057057350374175801344 store i352 %ins, i352* bitcast (%struct.foo* @a to i352*), align 4 ret void } ... which the code generator lowers to: _test: ## @test movq _a@GOTPCREL(%rip), %rax movq (%rax), %r8 movq 16(%rax), %rdx movq 24(%rax), %rsi movq 32(%rax), %rdi movl 40(%rax), %ecx movl %ecx, 40(%rax) movq %rdi, 32(%rax) movq %rsi, 24(%rax) movq %rdx, 16(%rax) movq %r8, (%rax) movl $4, 12(%rax) ret Clearly, LLVM should be able to produce something like this (which even GCC 4.2 is able to produce): _test: movl $4, 12(%rax) ret What is going on here? The problem is that the -O0 IR has lost the information that bytes 5-7 of the structure are structure padding. Because the front-end knows that they are undefined, the llvm.memcpy does not need to transfer them. When an alloca (like %tmp) is used as the source and destination of memcpy's, and when it has padding, the SRoA pass fails to scalarize the alloca into its elements, and therefore has to promote the entire thing into a ridiculous integer. We often fail to optimize away parts of these huge integers. While this particular case could (in principle) be improved, there are also other problems. The code generator ultimately lowers many llvm.memcpy's into individual loads and stores, e.g. for code like: struct foo a,b; void test() { a = b; } Because we have lost information about padding, we actually do a full transfer of the padding. This usually isn't a problem, but can cause store-load forwarding stalls in some processors, when the small element is previously stored as its smaller type. Finally, many of the other optimizations that hack on memory transfers (e.g. memdep, DSE, etc) can benefit from information about padding. //===----------------------------------------------------------------------===// // Structure Padding Information in IR //===----------------------------------------------------------------------===// The proposed solution to this problem is really straight-forward: Clang should (at -O1 and above) add an MDNode to structure memcpy instructions that inform the optimizer and code generator about structure padding. For example, the first generated memcpy above could be: ... call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast (%struct.foo* @a to i8*), i64 44, i32 4, i1 false), !padding !0 ... !0 = metadata !{ i64 5, i64 8 } This indicates that byte 5-8 (inclusive) are undefined in the memcpy destination. With this information, the code generator can avoid copying these bytes when inlining the memcpy, SRoA can cross reference the padding of the alloca with the padding defined by the memcpy, DSE can add this metadata to transfers which are later partially overwritten, etc.