[docs][PerformanceTips] Point people towards llvm-dev

[oota-llvm.git] / docs / Frontend / PerformanceTips.rst
diff --git a/docs/Frontend/PerformanceTips.rst b/docs/Frontend/PerformanceTips.rst

index 5d7ad590464c3611c459155172457f2e0fb854f0..27d0c430cdb6d8734944fc2103a9ead8b068e1b3 100644 (file)
--- a/docs/Frontend/PerformanceTips.rst
+++ b/docs/Frontend/PerformanceTips.rst
@@ -15,8 +15,11 @@ generate IR that optimizes well.  As with any optimizer, LLVM has its strengths
  and weaknesses.  In some cases, surprisingly small changes in the source IR 
  can have a large effect on the generated code.  
  
+IR Best Practices
+=================
+
  Avoid loads and stores of large aggregate type
-================================================
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
  LLVM currently does not optimize well loads and stores of large :ref:`aggregate
  types <t_aggregate>` (i.e. structs and arrays).  As an alternative, consider 
@@ -27,7 +30,7 @@ instruction supported by the targeted hardware are well supported.  These can
  be an effective way to represent collections of small packed fields.  
  
  Prefer zext over sext when legal
-==================================
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
  On some architectures (X86_64 is one), sign extension can involve an extra 
  instruction whereas zero extension can be folded into a load.  LLVM will try to
@@ -39,7 +42,7 @@ Alternatively, you can :ref:`specify the range of the value using metadata
  <range-metadata>` and LLVM can do the sext to zext conversion for you.
  
  Zext GEP indices to machine register width
-============================================
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
  Internally, LLVM often promotes the width of GEP indices to machine register
  width.  When it does so, it will default to using sign extension (sext) 
@@ -47,61 +50,188 @@ operations for safety.  If your source language provides information about
  the range of the index, you may wish to manually extend indices to machine 
  register width using a zext instruction.
  
-Other things to consider
-=========================
+Other Things to Consider
+^^^^^^^^^^^^^^^^^^^^^^^^
  
  #. Make sure that a DataLayout is provided (this will likely become required in
     the near future, but is certainly important for optimization).
  
-#. Add nsw/nuw/fast-math flags as appropriate
+#. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing 
+   analysis), prefer GEPs
+
+#. Use the "most-private" possible linkage types for the functions being defined
+   (private, internal or linkonce_odr preferably)
+
+#. Prefer globals over inttoptr of a constant address - this gives you 
+   dereferencability information.  In MCJIT, use getSymbolAddress to provide 
+   actual address.
+
+#. Be wary of ordered and atomic memory operations.  They are hard to optimize 
+   and may not be well optimized by the current optimizer.  Depending on your
+   source language, you may consider using fences instead.
+
+#. If calling a function which is known to throw an exception (unwind), use 
+   an invoke with a normal destination which contains an unreachable 
+   instruction.  This form conveys to the optimizer that the call returns 
+   abnormally.  For an invoke which neither returns normally or requires unwind
+   code in the current function, you can use a noreturn call instruction if 
+   desired.  This is generally not required because the optimizer will convert
+   an invoke with an unreachable unwind destination to a call instruction.
+
+#. Use profile metadata to indicate statically known cold paths, even if 
+   dynamic profiling information is not available.  This can make a large 
+   difference in code placement and thus the performance of tight loops.
+
+#. When generating code for loops, try to avoid terminating the header block of
+   the loop earlier than necessary.  If the terminator of the loop header 
+   block is a loop exiting conditional branch, the effectiveness of LICM will
+   be limited for loads not in the header.  (This is due to the fact that LLVM 
+   may not know such a load is safe to speculatively execute and thus can't 
+   lift an otherwise loop invariant load unless it can prove the exiting 
+   condition is not taken.)  It can be profitable, in some cases, to emit such 
+   instructions into the header even if they are not used along a rarely 
+   executed path that exits the loop.  This guidance specifically does not 
+   apply if the condition which terminates the loop header is itself invariant,
+   or can be easily discharged by inspecting the loop index variables.
+
+#. In hot loops, consider duplicating instructions from small basic blocks 
+   which end in highly predictable terminators into their successor blocks.  
+   If a hot successor block contains instructions which can be vectorized 
+   with the duplicated ones, this can provide a noticeable throughput
+   improvement.  Note that this is not always profitable and does involve a 
+   potentially large increase in code size.
+
+#. Avoid high in-degree basic blocks (e.g. basic blocks with dozens or hundreds
+   of predecessors).  Among other issues, the register allocator is known to 
+   perform badly with confronted with such structures.  The only exception to 
+   this guidance is that a unified return block with high in-degree is fine.
+
+#. When checking a value against a constant, emit the check using a consistent
+   comparison type.  The GVN pass *will* optimize redundant equalities even if
+   the type of comparison is inverted, but GVN only runs late in the pipeline.
+   As a result, you may miss the opportunity to run other important 
+   optimizations.  Improvements to EarlyCSE to remove this issue are tracked in 
+   Bug 23333.
+
+#. Avoid using arithmetic intrinsics unless you are *required* by your source 
+   language specification to emit a particular code sequence.  The optimizer 
+   is quite good at reasoning about general control flow and arithmetic, it is
+   not anywhere near as strong at reasoning about the various intrinsics.  If 
+   profitable for code generation purposes, the optimizer will likely form the 
+   intrinsics itself late in the optimization pipeline.  It is *very* rarely 
+   profitable to emit these directly in the language frontend.  This item
+   explicitly includes the use of the :ref:`overflow intrinsics <int_overflow>`.
+
+#. Avoid using the :ref:`assume intrinsic <int_assume>` until you've 
+   established that a) there's no other way to express the given fact and b) 
+   that fact is critical for optimization purposes.  Assumes are a great 
+   prototyping mechanism, but they can have negative effects on both compile 
+   time and optimization effectiveness.  The former is fixable with enough 
+   effort, but the later is fairly fundamental to their designed purpose.
+
+
+Describing Language Specific Properties
+=======================================
+
+When translating a source language to LLVM, finding ways to express concepts 
+and guarantees available in your source language which are not natively 
+provided by LLVM IR will greatly improve LLVM's ability to optimize your code. 
+As an example, C/C++'s ability to mark every add as "no signed wrap (nsw)" goes
+a long way to assisting the optimizer in reasoning about loop induction 
+variables and thus generating more optimal code for loops.  
+
+The LLVM LangRef includes a number of mechanisms for annotating the IR with 
+additional semantic information.  It is *strongly* recommended that you become 
+highly familiar with this document.  The list below is intended to highlight a 
+couple of items of particular interest, but is by no means exhaustive.
+
+Restricted Operation Semantics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+#. Add nsw/nuw flags as appropriate.  Reasoning about overflow is 
+   generally hard for an optimizer so providing these facts from the frontend 
+   can be very impactful.  
+
+#. Use fast-math flags on floating point operations if legal.  If you don't 
+   need strict IEEE floating point semantics, there are a number of additional 
+   optimizations that can be performed.  This can be highly impactful for 
+   floating point intensive computations.
+
+Describing Aliasing Properties
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  
  #. Add noalias/align/dereferenceable/nonnull to function arguments and return 
     values as appropriate
  
-#. Mark functions as readnone/readonly/nounwind when known (especially for 
-   external functions)
+#. Use pointer aliasing metadata, especially tbaa metadata, to communicate 
+   otherwise-non-deducible pointer aliasing facts
  
-#. Use ptrtoint/inttoptr sparingly (they interfere with pointer aliasing 
-   analysis), prefer GEPs
+#. Use inbounds on geps.  This can help to disambiguate some aliasing queries.
+
+
+Modeling Memory Effects
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+#. Mark functions as readnone/readonly/argmemonly or noreturn/nounwind when
+   known.  The optimizer will try to infer these flags, but may not always be
+   able to.  Manual annotations are particularly important for external 
+   functions that the optimizer can not analyze.
  
  #. Use the lifetime.start/lifetime.end and invariant.start/invariant.end 
     intrinsics where possible.  Common profitable uses are for stack like data 
     structures (thus allowing dead store elimination) and for describing 
     life times of allocas (thus allowing smaller stack sizes).  
  
-#. Use pointer aliasing metadata, especially tbaa metadata, to communicate 
-   otherwise-non-deducible pointer aliasing facts
-
-#. Use the "most-private" possible linkage types for the functions being defined
-   (private, internal or linkonce_odr preferably)
-
  #. Mark invariant locations using !invariant.load and TBAA's constant flags
  
-#. Prefer globals over inttoptr of a constant address - this gives you 
-   dereferencability information.  In MCJIT, use getSymbolAddress to provide 
-   actual address.
+Pass Ordering
+^^^^^^^^^^^^^
  
-#. Be wary of ordered and atomic memory operations.  They are hard to optimize 
-   and may not be well optimized by the current optimizer.  Depending on your
-   source language, you may consider using fences instead.
+One of the most common mistakes made by new language frontend projects is to 
+use the existing -O2 or -O3 pass pipelines as is.  These pass pipelines make a
+good starting point for an optimizing compiler for any language, but they have 
+been carefully tuned for C and C++, not your target language.  You will almost 
+certainly need to use a custom pass order to achieve optimal performance.  A 
+couple specific suggestions:
+
+#. For languages with numerous rarely executed guard conditions (e.g. null 
+   checks, type checks, range checks) consider adding an extra execution or 
+   two of LoopUnswith and LICM to your pass order.  The standard pass order, 
+   which is tuned for C and C++ applications, may not be sufficient to remove 
+   all dischargeable checks from loops.
  
  #. If you language uses range checks, consider using the IRCE pass.  It is not 
     currently part of the standard pass order.
  
-p.s. If you want to help improve this document, patches expanding any of the 
-above items into standalone sections of their own with a more complete 
-discussion would be very welcome.  
+#. A useful sanity check to run is to run your optimized IR back through the 
+   -O2 pipeline again.  If you see noticeable improvement in the resulting IR, 
+   you likely need to adjust your pass order.
+
+
+I Still Can't Find What I'm Looking For
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you didn't find what you were looking for above, consider proposing an piece
+of metadata which provides the optimization hint you need.  Such extensions are
+relatively common and are generally well received by the community.  You will 
+need to ensure that your proposal is sufficiently general so that it benefits 
+others if you wish to contribute it upstream.
  
+You should also consider describing the problem you're facing on `llvm-dev 
+<http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ and asking for advice.  
+It's entirely possible someone has encountered your problem before and can 
+give good advice.  If there are multiple interested parties, that also 
+increases the chances that a metadata extension would be well received by the
+community as a whole.  
  
  Adding to this document
  =======================
  
  If you run across a case that you feel deserves to be covered here, please send
  a patch to `llvm-commits
-<http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits>`_ for review.
+<http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review.
  
-If you have questions on these items, please direct them to `llvmdev 
-<http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>`_.  The more relevant 
+If you have questions on these items, please direct them to `llvm-dev 
+<http://lists.llvm.org/mailman/listinfo/llvm-dev>`_.  The more relevant 
  context you are able to give to your question, the more likely it is to be 
  answered.