//===----------------------------------------------------------------------===//
//         Fix the PassManager to make the inliner work *correctly*
//===----------------------------------------------------------------------===//

2004-08-18


The problem: inlining is an extremely important optimization for many classes
of programming languages.  The traditional way to inline is to inline very
aggressively (early in the compile process), up to a particular threshold.  The
idea is that subsequent scalar and other optimizations will optimize away most
of what is left.

There are several problems with this.  First, in many cases, this does not
inline ENOUGH.  In particular, programs with many layzers of abstraction simply
*dissolve* during optimization, once layer after layer are peeled away.
Because we inline up to a threshold, once the inline threshold has been met,
nothing further is inlined.

Second, this approach inlines TOO MUCH.  Largely because of the first problem,
and due to benchmarks like stepanov, compiler authors tend to crank up the
inlining threshold way too high, or otherwise make their heuristics inline way
too much.  This bloats executables and leads to excessive compile time.
Because not all code has layers of abstractions that are easily removed,
inlining with the same threshold that for "abstract" programs bloats in either
case.

Third, even if we inline exactly the right amount (given omniscent heuristics),
we still lose.  The problem is that code is duplicated (inlined) all over the
place before optimization.  Because this code is duplicated, it has to be
optimized away in every place that it is inlined, increasing compile time.  It 
would be better if we could optimize the code BEFORE the inline happens, so
that *optimized* code is duplicated instead of unoptimized code.

Both GCC and LLVM suffer from this problem, though GCC does to a much higher
degree.  In particular, GCC does very very little optimization before
inlining (mostly tree expression folding).  LLVM does a bit more, but is
limited to mem2reg and simplifycfg.  GCC inlines very aggressively (at -O3),
which leads to code bloat and slow compile times.  LLVM has a more conservative
inliner (which it can usually afford due to IP optzns) which causes it to miss
certain "obvious" cases.


//===----------------------------------------------------------------------===//
// The Solution
//

The fix for this is actually very easy and straight-forward.  In particular,
The inliner (and a few other optzns) are CallGraphSCCPass's, meaning that they
operate "bottom-up" on the call graph (visiting callees before callers).
Currently, the CallGraphSCCPass class is just a "Pass", and is scheduled
exactly as such by the PassManager.  Instead, the PassManager should be
enhanced to understand the semantics of SGSCCPasses (just like it knows about
FunctionPass'es) and schedule function passes into them.

In particular, consider the pass sequence "-inline -mem2reg".  Instead of
running the -inline pass on the whole module, then the -mem2reg pass on every
function in the module, in this configuration:

PassManager
  Inliner
  FunctionPassBatcher
    Mem2Reg

Ihe pass manager should interlace execution of mem2reg *with* executions of
the inliner pass, in this configuration:

PassManager
  CallGraphSCCPassBatcher
    Inliner
    SCCFunctionPassBatcher
      Mem2reg


Consider a program with a call graph like this:

void A() { B(); C(); }    void B() {}   void C() {}

On this program, the inliner should be run on B (doing nothing, no callees), 
then the mem2reg pass should be run on B.  Next the inliner should be run on
C (doing nothing), followed by mem2reg on C.  Once B and C are done, the
inliner would run on A, and (finally) mem2reg would be run on A.

The strenght of this approach is that it is A. simple and automatic, and B.
effective.  It is simple because we just need to introduce a new batcher
class for CallGraphSCCPass's, and effective because we are now inlining
functions that have already been optimized by all scalar optimizations.
This means that we should be able to set the inlining threshold lower (as
there is less chance that subsequent optimizations will decimate the
remaining code), we are inlining optimized code, so we don't have to
duplicate optimizations.  The pre-existing heuristics for inlining (which
guess when inlining will make other important optimizations more effective
in the caller) automatically become more effective and less pessimistic.

Note that there are some funny things that this implies.  In particular, if
you have an empty pass manager and add a function pass to it, the function
pass will be run on all functions in the module in module order.  OTOH, if
you add a SCC Pass then a FunctionPass, the funciton pass will be executed
in SCC order on the function (bottom up on the call graph), not in module
order.  Funny how we defined the semantics of FunctionPass's to not be able
to notice this. :)

Enhancing the PassManager is currently more difficult than it should be,
owing largely to the fact that PassManagerT.h is a mess.  It should be
rewritten in an object oriented style, in the process, allowing stuff like
this (and the loop optimizer) to be easily added.