Extensible LLVM Metadata
------------------------

Sep 10, 2009

Adding 'extra' information to LLVM instructions is something that people 
have long been interested in, and that we have been doing in ad-hoc ways.  For
example, the 'align' field on load and store operations, the 'nsw' 'nuw' and 
'inbounds' annotations on arithmetic and getelementptr instructions, and we
even have added new intrinsics (like llvm.memcpy) just to encode alignment
metadata.

Looking into the future, there is a lot more that we want to do that fit into
the category of metadata: we want instructions to directly maintain their
debug info location info (instead of modeling it with llvm.dbg.stoppoint), we
want to be able to represent TBAA information, we want to be able to know which
calls to C++ constructors and destructors are elidable, we want to attach
unwind information to invoke instructions, etc.

The demand for metadata is even greater in non-C languages.  For high level
scripting languages like python, it would be nice to be able to do static type
inference and annotate various pointers with type class information ("this is
a python string object", "this is a dictionary", whatever).

While many people have long been looking for metadata and we have some simple
forms of it now, the big problem is how to make it extensible by clients that
the optimizer doesn't know about (e.g. a python frontend).  For example, it
would be great for LLVM optimizers to "pass along" metadata that they don't
understand, allowing a python specific llvm ir optimization to act on the
metadata.

There is one more wrinkle that is important to draw out.  Debug info is
interesting because the optimizer is *not allowed* to change its behavior in
the presence of debug info.  If debug info would prevent some transformation,
it should either be updated or stripped out of the way.  On the other hand,
other information (such as TBAA) always *must* be kept up to date.  If you end
up with a TBAA tag on a load and the load was not supposed to get it, then it
is entirely possible that you will miscompile the program.  Here is a simple
motivating example:


if (..) {
   ...
   A = load P, TBAA #1234
} else {
   ...
   B = load P, TBAA #4321
}
C = phi(A, B)

One way to sink the loads (commonizing them, reducing code size etc), would be
to move A to C, do a C->replacealluseswith(A), then delete B.  Doing the
transformation like this would result in changing the type tag on the else
branch to #1234 though, so this would be illegal.

Note that this document talks about metadata for instructions, it might make
sense to generalize this to being metadata for all non-uniqued values (global
variables, functions, basic blocks, arguments), but I'm just keeping it
simple for now.

//===----------------------------------------------------------------------===//
// Proposed model for Metadata
//===----------------------------------------------------------------------===//

I think that we should consider metadata to always be optional (stripping it
off is always correct, even though it is not optimal).  For debug info this is
obviously true.  For TBAA information, removing a TBAA tag puts the operation
in the 'aliases everything' set.  Stripping off alignment off a load or store
resets the alignment tag to 1 (not 'default alignment for the type').

Each metadata kind is given a unique ID (MDKind) which is just a class wrapper
around an 'unsigned short' for type safety.  LLVM core would provide builtin
MDKinds for the metadata that it knows about like load/store alignment, nuw,
inbounds, etc.  We'd add new ones for debug loc, TBAA, etc when they come
around.

Each instruction should be allowed to have an arbitrary number of metadata
<MDKind,MDNode*> records associated with them, but each instruction can only
have one of a particular type.  It doesn't make sense to have two different
TBAA tags or alignment records.

Given this, we should have an API that it something along the lines of:

  MDNode *TBAAInfo = Inst->getMD(MDKind::TBAATag);
  Inst2->setMD(MDKind::TBAATag, TBAAInfo);

and:
  MyLoad->setMD(MDKind::LdStAlign, MDNode::Create(ConstantInt::get(16, ..)));

Note that though people would never write this (they'd just use 
LoadInst::setAlign), we'd expect this to work for consistency.

//===----------------------------------------------------------------------===//
// Extensible Metadata kinds.
//===----------------------------------------------------------------------===//

Once we have this sort of world, adding extensible metadata is pretty simple:
we'd just give LLVMContext a new method to register metadata kinds, which
returns an MDKind.  For example:

  MDKind TypeLatticeKind =
    TheContext->RegisterMDKind("unladenswallow.type.lattice");
  SomeInst->setMD(TypeLatticeKind, SomeMDNode);

obviously we could use reverse URI's if people prefer that, but llvm shouldn't
care.  The important part of this is that we retain the ability to reproduce a
textual version of the LLVM IR.  When printing out LLVM IR to .ll files, we'd
get something like:

!42 = metadata { ...}
...

  %X = load i32* %P, align 16, unladenswallow.type.lattice !42, tbaa !21


//===----------------------------------------------------------------------===//
// Efficient encoding of alignment, debug info, and other metadata.
//===----------------------------------------------------------------------===//

Metadata comes in two kinds: metadata that is lightly sprinkled through the IR
only in special cases (such as ctor/dtor information for NRVO) and information
that is on everything in common cases (load alignment information, debug info
locations, etc).  I'd like to retain full generality of information storage
without penalizing the common cases.

To support this, using load/store alignment as an example, we would keep the
special case for efficiently encoding alignment information in a few bits in
the load/store operation, but would enhance setMD/getMD to paper over the
optimization.  For example, setMD could look like this:

void Instruction::setMD(MDKind K, MDNode *Node) {
  switch (K.getBuiltinKind()) {
  case MDKind::LdStAlignID:
    // This metadata is only valid on loads and stores.
    if (LoadInst *LI = dyn_cast<LoadInst>(this))
      LI->setAlignment(getIntFromMDNode(Node));
    else {
      StoreInst *SI = cast<StoreInst>(this);
      SI->setAlignment(getIntFromMDNode(Node));
    }
...

The other half of the efficiency issue is handling metadata that should not be
baked into instructions in ad-hoc ways: things like TBAA information, random
user extensible metadata, NRVO info, etc.  In principle, we could have every
instruction have a vector or map of pair<MDKind,MDNode*>'s, but this would be
really wasteful for the common case of no metadata associated with an
instruction.

Instead, we should implement this like ValueHandles are stored: store the
metadata in a hash table on the side (DenseMap<Instruction*, mdinfo>) and have
a bit in Instruction to indicate that the instruction has an entry in the hash
table.  This optimizes the common case (no metadata on an instruction) but
allows arbitrary extensibility.

//===----------------------------------------------------------------------===//
// Optimizer Impact
//===----------------------------------------------------------------------===//

If we go with this model for metadata, we need a way for optimizers to preserve
and update the metadata without having to know about the semantics of the
metadata.  Ultimately, we really don't want (and can't have) all optimizers to 
know about all the different types of metadata and keep them up to date.
Fortunately, metadata (as defined above) can always be dropped when a pass
is not aware of it and doesn't know what to do with it.

In order for various transformations to be safe in the presence of metadata,
the optimizer needs to be metadata aware and either remove all metadata when it
makes certain changes or update it if it is specifically aware of specific
kinds of metadata.

There are various ways we can implement this.  Cloning an instruction can just
clone the metadata for example.  In the example above where two operations are
"merged" together into one, we can make the default be to keep metadata that is
identical between the two original operations or drop it if not.

Additionally, we could make the metadata registration hook allow callbacks to
be registered for the metadata to provide a "merge handler".  This can be
useful in a number of cases: for example merging a load with alignment 8 and a
load with alignment 4 could merge to get a load with alignment of 4.  In in
python type system example, the merge operation would be to change the
metadata to be common ancestor in the type lattice.

It is likely that we'll want to define a number of transformation utilities.
As the optimizer becomes more metadata aware, we can expand on these.