//===----------------------------------------------------------------------===//
//                      Aggregates as First Class Values
//===----------------------------------------------------------------------===//

4/26/2008 - initial revision

LLVM has a distinction between first class values and other types of values.  
First class values can be returned by instructions, passed to functions, 
loaded, stored, PHI'd etc.  Currently the first class value types are:

  1. Integer
  2. Floating point
  3. Pointer
  4. Vector
  5. Opaque (which is assumed to eventually resolve to 1-4)

The non-first-class types are:

  5. Array
  6. Structure/Packed Structure
  7. Function
  8. Void
  9. Label

In addition to first class values, there is a new wrinkle as of LLVM 2.3:
calls and invokes are allowed to return structs directly to support Multiple
Return Values.  This is currently an odd special case just for calls, and only
currently supports structs whose elements are all first class values.  This
proposal basically suggests that we extend this to make arbitrary structs and
arrays first-class values as well.

There is a separate proposal to eliminate Void, and Label may eventually also
become a first-class type (to directly support taking the address of a basic
block).

//===----------------------------------------------------------------------===//
// The idea

The basic idea is simple: structs and arrays become first class values, so they
can be loaded, stored, PHI'd, passed/returned by-value to/from calls etc.
Because they are first-class "register" values, it is not possible to take
their address.  Also, I do not propose making arithmetic operators work on 
these types (you can't add to arrays of double for example).

To support creation of these first class aggregates, a new instruction to
complement getresult is added, named 'insertvalue' (I'd like to name it 
'insertelement' but that is already taken for vectors, and insertelement
allows a variable index unlike insertvalue). 'insertvalue' takes two Value
operands: a input aggregate to 'update' and an input value to insert into
it.  It also takes a variadic list of integers that specify the insertion
position.  The result type of the instruction is always the same type as
the first Value operand.  For example:

  %X = insertvalue {i32, i8} %C, i32 %V, 0  ; yields {i32, i8}
  %Y = insertvalue {float, {i32, i8}} %I, {i32, i8} %X, 1
  %Z = insertvalue {float, {i32, i8}} %Y, i32 42, 1, 0

Since we are generalizing getresult, I propose we rename it to 'extractvalue'.
It works just like 'getresult' does today, but takes a variadic list of
indices:

  %Q = extractvalue {float, {i32, i8}} %Z, 1, 1  ; Returns i8

The indices do not have types because they are not LLVM Values, they are simple
integers like the index for getresult today is.

//===----------------------------------------------------------------------===//
// Code Generation

Code generation for these aggregates is very simple: they are implicitly
scalarized and all of the elements are turned into registers.  This should
happen just like we current split up vectors into smaller vectors (or scalars)
when the vector type is not supported by the target.  This can all be done in
'SelectionDAGISel' so the majority of the code generator won't even notice.
Also, the optimizer will generally decompose these as much as possible, just
to remove abstraction and from the code and expose the computation.

An interesting subtlty of this is that "padding" in aggregates is not modified
by stores of a struct.  Also, because elements are scalarized immediately,
large aggregates will be as efficient as handling their elements.  This means
that very large aggregates (those with dozens of elements) will not be
efficient: it would probably be better to leave them in memory.  This also
means that you shouldn't try to emulate memcpy by doing large load and stores: 
it will be more efficient to use memcpy, which can do bulk transfer instead of
element-by-element copies.

As far as the ABI goes, I suggest that aggregates be passed just like 'byval'
arguments by default.  The only difference is that the value is passed and
received, instead of a shadow copy of the pointee passed in.

//===----------------------------------------------------------------------===//
// Benefits

The first benefit of this change is that the LLVM IR becomes more orthogonal:
right now, calls returning aggregates is a strange special case that the
compiler has to handle.  If aggregates were first class values, the magic
wouldn't be part of calls.  This would also allow use to make 'return' more
consistent: it would always take zero or one argument (and killing off void
would make it always be one) instead of being variadic.

The next benefit is that this makes front-end authors lives easier in a few
cases.  For example, some languages have first-class support for generic 
tuples and this means that they don't have to track the elements themselves or
force the tuples into memory.  One specific example of this is handling of
complex numbers: Currently clang has a duplicate copy of the scalar code
generator that is specialized for complex because "codegen" on an expression
needs to return two Value*'s instead of one.  llvm-gcc avoids this by forcing
things into memory, but this is bad for compile time and performance at -O0.

Another benefit is that this may potentially help solve some of the 
performance issues we suffer with small 'byval' arguments.  'byval' is
incredibly important for the case when someone passes a large struct by
value (which people do sometimes, without knowing the perf cost), but
it produces poor code when the byval argument is a *small* struct.  The
performance hit comes from the fact that the value is forced into memory
(because byval requires its address to be exposed).  For small structs that 
are usually promoted to registers by SRoA (e.g. long double complex on some 
architectures) this ends up causing a lot of stack traffic that is unneeded
and generally disables mem2reg and sroa on values passed to functions.  It
is actually interesting that byval and first-class aggregates have exactly
complementary strengths: the first is great for large structs, the later is
great for small ones.

The final benefit is that a small collections of optimizations will improve.
For example, mem2reg would suddenly subsume a big chunk of the functionality
of SRoA, it would be easy to get interprocedural passes propagating 
information through these arguments (e.g. interprocedural constant prop of
elements of the struct), the inliner doesn't need to make a memcpy for these
like it does for byval, etc.