//===----------------------------------------------------------------------===//
//                   Better Optimization for Structure Copies
//===----------------------------------------------------------------------===//

2/8/2012 - initial revision

Copies of C structures and C++ by-value classes are very common, particularly
for C++ types with value semantics.  A common cause of C++ "abstraction
penalty" comes from cases where the optimizer is not able to adequately
optimize away structure temporaries.  This note describes the case when one
of these temporaries is copied, causing the front-end to emit an llvm.memcpy
call.

Overall, LLVM does a pretty good job at optimizing away memcpy and memmove
calls.  It can perform dead store elimination, forward propagate values through
these transfers, can form them out of scalar stores and loops, and can even 
shrink them when it can detect that they are partially dead.  However, one 
case that it does not handle well is when a struct has padding in it.

Consider the following struct and function:

struct foo {
  int x;
  char y;
  int a,b,c,d,e,f,g,h,i;
};

struct foo a;

void test() {
  struct foo tmp = a;
  tmp.b = 4;
  a = tmp;
}

At -O0, clang compiles this into this straight-forward IR:

%struct.foo = type { i32, i8, i32, i32, i32, i32, i32, i32, i32, i32, i32 }

@a = common global %struct.foo zeroinitializer, align 4

define void @test() nounwind uwtable ssp {
entry:
  %tmp = alloca %struct.foo, align 4
  %0 = bitcast %struct.foo* %tmp to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast (%struct.foo* @a to i8*), i64 44, i32 4, i1 false)
  %b = getelementptr inbounds %struct.foo* %tmp, i32 0, i32 3
  store i32 4, i32* %b, align 4
  %1 = bitcast %struct.foo* %tmp to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast (%struct.foo* @a to i8*), i8* %1, i64 44, i32 4, i1 false)
  ret void
}

However, things get messy when the optimizer is turned on - we get this code:

define void @test() nounwind uwtable ssp {
entry:
  %srcval1 = load i352* bitcast (%struct.foo* @a to i352*), align 4
  %mask = and i352 %srcval1, -340282366841710300949110269838224261121
  %ins = or i352 %mask, 316912650057057350374175801344
  store i352 %ins, i352* bitcast (%struct.foo* @a to i352*), align 4
  ret void
}

... which the code generator lowers to:

_test:                                  ## @test
	movq	_a@GOTPCREL(%rip), %rax
	movq	(%rax), %r8
	movq	16(%rax), %rdx
	movq	24(%rax), %rsi
	movq	32(%rax), %rdi
	movl	40(%rax), %ecx
	movl	%ecx, 40(%rax)
	movq	%rdi, 32(%rax)
	movq	%rsi, 24(%rax)
	movq	%rdx, 16(%rax)
	movq	%r8, (%rax)
	movl	$4, 12(%rax)
	ret

Clearly, LLVM should be able to produce something like this (which even GCC 4.2
is able to produce):

_test:
	movl	$4, 12(%rax)
	ret

What is going on here?  The problem is that the -O0 IR has lost the information
that bytes 5-7 of the structure are structure padding.  Because the front-end
knows that they are undefined, the llvm.memcpy does not need to transfer them.
When an alloca (like %tmp) is used as the source and destination of memcpy's,
and when it has padding, the SRoA pass fails to scalarize the alloca into its
elements, and therefore has to promote the entire thing into a ridiculous
integer.  We often fail to optimize away parts of these huge integers.

While this particular case could (in principle) be improved, there are also
other problems.  The code generator ultimately lowers many llvm.memcpy's into
individual loads and stores, e.g. for code like:

struct foo a,b;

void test() {
  a = b;
}

Because we have lost information about padding, we actually do a full transfer
of the padding.  This usually isn't a problem, but can cause store-load
forwarding stalls in some processors, when the small element is previously
stored as its smaller type.

Finally, many of the other optimizations that hack on memory transfers (e.g.
memdep, DSE, etc) can benefit from information about padding.

//===----------------------------------------------------------------------===//
//                  Structure Padding Information in IR
//===----------------------------------------------------------------------===//

The proposed solution to this problem is really straight-forward: Clang should
(at -O1 and above) add an MDNode to structure memcpy instructions that inform
the optimizer and code generator about structure padding.  For example, the
first generated memcpy above could be:

  ...
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* bitcast (%struct.foo* @a to
i8*), i64 44, i32 4, i1 false), !padding !0
  ...
!0 = metadata !{ i64 5, i64 8 }

This indicates that byte 5-8 (inclusive) are undefined in the memcpy
destination.  With this information, the code generator can avoid copying these
bytes when inlining the memcpy, SRoA can cross reference the padding of the
alloca with the padding defined by the memcpy, DSE can add this metadata to
transfers which are later partially overwritten, etc.