+<p>The first two bytes of a bitcode file are 'BC' (0x42, 0x43).
+The second two bytes are an application-specific magic number. Generic
+bitcode tools can look at only the first two bytes to verify the file is
+bitcode, while application-specific programs will want to look at all four.</p>
+
+</div>
+
+<!-- ======================================================================= -->
+<div class="doc_subsection"><a name="primitives">Primitives</a>
+</div>
+
+<div class="doc_text">
+
+<p>
+A bitstream literally consists of a stream of bits, which are read in order
+starting with the least significant bit of each byte. The stream is made up of a
+number of primitive values that encode a stream of unsigned integer values.
+These
+integers are are encoded in two ways: either as <a href="#fixedwidth">Fixed
+Width Integers</a> or as <a href="#variablewidth">Variable Width
+Integers</a>.
+</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="fixedwidth">Fixed Width Integers</a>
+</div>
+
+<div class="doc_text">
+
+<p>Fixed-width integer values have their low bits emitted directly to the file.
+ For example, a 3-bit integer value encodes 1 as 001. Fixed width integers
+ are used when there are a well-known number of options for a field. For
+ example, boolean values are usually encoded with a 1-bit wide integer.
+</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="variablewidth">Variable Width
+Integers</a></div>
+
+<div class="doc_text">
+
+<p>Variable-width integer (VBR) values encode values of arbitrary size,
+optimizing for the case where the values are small. Given a 4-bit VBR field,
+any 3-bit value (0 through 7) is encoded directly, with the high bit set to
+zero. Values larger than N-1 bits emit their bits in a series of N-1 bit
+chunks, where all but the last set the high bit.</p>
+
+<p>For example, the value 27 (0x1B) is encoded as 1011 0011 when emitted as a
+vbr4 value. The first set of four bits indicates the value 3 (011) with a
+continuation piece (indicated by a high bit of 1). The next word indicates a
+value of 24 (011 << 3) with no continuation. The sum (3+24) yields the value
+27.
+</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="char6">6-bit characters</a></div>
+
+<div class="doc_text">
+
+<p>6-bit characters encode common characters into a fixed 6-bit field. They
+represent the following characters with the following 6-bit values:</p>
+
+<div class="doc_code">
+<pre>
+'a' .. 'z' — 0 .. 25
+'A' .. 'Z' — 26 .. 51
+'0' .. '9' — 52 .. 61
+ '.' — 62
+ '_' — 63
+</pre>
+</div>
+
+<p>This encoding is only suitable for encoding characters and strings that
+consist only of the above characters. It is completely incapable of encoding
+characters not in the set.</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="wordalign">Word Alignment</a></div>
+
+<div class="doc_text">
+
+<p>Occasionally, it is useful to emit zero bits until the bitstream is a
+multiple of 32 bits. This ensures that the bit position in the stream can be
+represented as a multiple of 32-bit words.</p>
+
+</div>
+
+
+<!-- ======================================================================= -->
+<div class="doc_subsection"><a name="abbrevid">Abbreviation IDs</a>
+</div>
+
+<div class="doc_text">
+
+<p>
+A bitstream is a sequential series of <a href="#blocks">Blocks</a> and
+<a href="#datarecord">Data Records</a>. Both of these start with an
+abbreviation ID encoded as a fixed-bitwidth field. The width is specified by
+the current block, as described below. The value of the abbreviation ID
+specifies either a builtin ID (which have special meanings, defined below) or
+one of the abbreviation IDs defined by the stream itself.
+</p>
+
+<p>
+The set of builtin abbrev IDs is:
+</p>
+
+<ul>
+<li><tt>0 - <a href="#END_BLOCK">END_BLOCK</a></tt> — This abbrev ID marks
+ the end of the current block.</li>
+<li><tt>1 - <a href="#ENTER_SUBBLOCK">ENTER_SUBBLOCK</a></tt> — This
+ abbrev ID marks the beginning of a new block.</li>
+<li><tt>2 - <a href="#DEFINE_ABBREV">DEFINE_ABBREV</a></tt> — This defines
+ a new abbreviation.</li>
+<li><tt>3 - <a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> — This ID
+ specifies the definition of an unabbreviated record.</li>
+</ul>
+
+<p>Abbreviation IDs 4 and above are defined by the stream itself, and specify
+an <a href="#abbrev_records">abbreviated record encoding</a>.</p>
+
+</div>
+
+<!-- ======================================================================= -->
+<div class="doc_subsection"><a name="blocks">Blocks</a>
+</div>
+
+<div class="doc_text">
+
+<p>
+Blocks in a bitstream denote nested regions of the stream, and are identified by
+a content-specific id number (for example, LLVM IR uses an ID of 12 to represent
+function bodies). Block IDs 0-7 are reserved for <a href="#stdblocks">standard blocks</a>
+whose meaning is defined by Bitcode; block IDs 8 and greater are
+application specific. Nested blocks capture the hierachical structure of the data
+encoded in it, and various properties are associated with blocks as the file is
+parsed. Block definitions allow the reader to efficiently skip blocks
+in constant time if the reader wants a summary of blocks, or if it wants to
+efficiently skip data they do not understand. The LLVM IR reader uses this
+mechanism to skip function bodies, lazily reading them on demand.
+</p>
+
+<p>
+When reading and encoding the stream, several properties are maintained for the
+block. In particular, each block maintains:
+</p>
+
+<ol>
+<li>A current abbrev id width. This value starts at 2, and is set every time a
+ block record is entered. The block entry specifies the abbrev id width for
+ the body of the block.</li>
+
+<li>A set of abbreviations. Abbreviations may be defined within a block, in
+ which case they are only defined in that block (neither subblocks nor
+ enclosing blocks see the abbreviation). Abbreviations can also be defined
+ inside a <tt><a href="#BLOCKINFO">BLOCKINFO</a></tt> block, in which case
+ they are defined in all blocks that match the ID that the BLOCKINFO block is
+ describing.
+</li>
+</ol>
+
+<p>
+As sub blocks are entered, these properties are saved and the new sub-block has
+its own set of abbreviations, and its own abbrev id width. When a sub-block is
+popped, the saved values are restored.
+</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="ENTER_SUBBLOCK">ENTER_SUBBLOCK
+Encoding</a></div>
+
+<div class="doc_text">
+
+<p><tt>[ENTER_SUBBLOCK, blockid<sub>vbr8</sub>, newabbrevlen<sub>vbr4</sub>,
+ <align32bits>, blocklen<sub>32</sub>]</tt></p>
+
+<p>
+The <tt>ENTER_SUBBLOCK</tt> abbreviation ID specifies the start of a new block
+record. The <tt>blockid</tt> value is encoded as an 8-bit VBR identifier, and
+indicates the type of block being entered, which can be
+a <a href="#stdblocks">standard block</a> or an application-specific block.
+The <tt>newabbrevlen</tt> value is a 4-bit VBR, which specifies the abbrev id
+width for the sub-block. The <tt>blocklen</tt> value is a 32-bit aligned value
+that specifies the size of the subblock in 32-bit words. This value allows the
+reader to skip over the entire block in one jump.
+</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="END_BLOCK">END_BLOCK
+Encoding</a></div>
+
+<div class="doc_text">
+
+<p><tt>[END_BLOCK, <align32bits>]</tt></p>
+
+<p>
+The <tt>END_BLOCK</tt> abbreviation ID specifies the end of the current block
+record. Its end is aligned to 32-bits to ensure that the size of the block is
+an even multiple of 32-bits.
+</p>
+
+</div>
+
+
+
+<!-- ======================================================================= -->
+<div class="doc_subsection"><a name="datarecord">Data Records</a>
+</div>
+
+<div class="doc_text">
+<p>
+Data records consist of a record code and a number of (up to) 64-bit integer
+values. The interpretation of the code and values is application specific and
+there are multiple different ways to encode a record (with an unabbrev record or
+with an abbreviation). In the LLVM IR format, for example, there is a record
+which encodes the target triple of a module. The code is
+<tt>MODULE_CODE_TRIPLE</tt>, and the values of the record are the ASCII codes
+for the characters in the string.
+</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="UNABBREV_RECORD">UNABBREV_RECORD
+Encoding</a></div>
+
+<div class="doc_text">
+
+<p><tt>[UNABBREV_RECORD, code<sub>vbr6</sub>, numops<sub>vbr6</sub>,
+ op0<sub>vbr6</sub>, op1<sub>vbr6</sub>, ...]</tt></p>
+
+<p>
+An <tt>UNABBREV_RECORD</tt> provides a default fallback encoding, which is both
+completely general and extremely inefficient. It can describe an arbitrary
+record by emitting the code and operands as vbrs.
+</p>
+
+<p>
+For example, emitting an LLVM IR target triple as an unabbreviated record
+requires emitting the <tt>UNABBREV_RECORD</tt> abbrevid, a vbr6 for the
+<tt>MODULE_CODE_TRIPLE</tt> code, a vbr6 for the length of the string, which is
+equal to the number of operands, and a vbr6 for each character. Because there
+are no letters with values less than 32, each letter would need to be emitted as
+at least a two-part VBR, which means that each letter would require at least 12
+bits. This is not an efficient encoding, but it is fully general.
+</p>
+
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"> <a name="abbrev_records">Abbreviated Record
+Encoding</a></div>
+
+<div class="doc_text">
+
+<p><tt>[<abbrevid>, fields...]</tt></p>
+
+<p>
+An abbreviated record is a abbreviation id followed by a set of fields that are
+encoded according to the <a href="#abbreviations">abbreviation definition</a>.
+This allows records to be encoded significantly more densely than records
+encoded with the <tt><a href="#UNABBREV_RECORD">UNABBREV_RECORD</a></tt> type,
+and allows the abbreviation types to be specified in the stream itself, which
+allows the files to be completely self describing. The actual encoding of
+abbreviations is defined below.
+</p>
+
+</div>
+
+<!-- ======================================================================= -->
+<div class="doc_subsection"><a name="abbreviations">Abbreviations</a>
+</div>
+
+<div class="doc_text">
+<p>
+Abbreviations are an important form of compression for bitstreams. The idea is
+to specify a dense encoding for a class of records once, then use that encoding
+to emit many records. It takes space to emit the encoding into the file, but
+the space is recouped (hopefully plus some) when the records that use it are
+emitted.
+</p>
+
+<p>
+Abbreviations can be determined dynamically per client, per file. Because the
+abbreviations are stored in the bitstream itself, different streams of the same
+format can contain different sets of abbreviations if the specific stream does
+not need it. As a concrete example, LLVM IR files usually emit an abbreviation
+for binary operators. If a specific LLVM module contained no or few binary
+operators, the abbreviation does not need to be emitted.
+</p>
+</div>
+
+<!-- _______________________________________________________________________ -->
+<div class="doc_subsubsection"><a name="DEFINE_ABBREV">DEFINE_ABBREV
+ Encoding</a></div>
+
+<div class="doc_text">
+
+<p><tt>[DEFINE_ABBREV, numabbrevops<sub>vbr5</sub>, abbrevop0, abbrevop1,
+ ...]</tt></p>
+
+<p>
+A <tt>DEFINE_ABBREV</tt> record adds an abbreviation to the list of currently
+defined abbreviations in the scope of this block. This definition only exists
+inside this immediate block — it is not visible in subblocks or enclosing
+blocks. Abbreviations are implicitly assigned IDs sequentially starting from 4
+(the first application-defined abbreviation ID). Any abbreviations defined in a
+<tt>BLOCKINFO</tt> record receive IDs first, in order, followed by any
+abbreviations defined within the block itself. Abbreviated data records
+reference this ID to indicate what abbreviation they are invoking.
+</p>
+
+<p>
+An abbreviation definition consists of the <tt>DEFINE_ABBREV</tt> abbrevid
+followed by a VBR that specifies the number of abbrev operands, then the abbrev
+operands themselves. Abbreviation operands come in three forms. They all start
+with a single bit that indicates whether the abbrev operand is a literal operand
+(when the bit is 1) or an encoding operand (when the bit is 0).
+</p>
+
+<ol>
+<li>Literal operands — <tt>[1<sub>1</sub>, litvalue<sub>vbr8</sub>]</tt>
+— Literal operands specify that the value in the result is always a single
+specific value. This specific value is emitted as a vbr8 after the bit
+indicating that it is a literal operand.</li>
+<li>Encoding info without data — <tt>[0<sub>1</sub>,
+ encoding<sub>3</sub>]</tt> — Operand encodings that do not have extra
+ data are just emitted as their code.
+</li>
+<li>Encoding info with data — <tt>[0<sub>1</sub>, encoding<sub>3</sub>,
+value<sub>vbr5</sub>]</tt> — Operand encodings that do have extra data are
+emitted as their code, followed by the extra data.
+</li>
+</ol>
+
+<p>The possible operand encodings are:</p>
+
+<ol>
+<li value="1">Fixed: The field should be emitted as
+ a <a href="#fixedwidth">fixed-width value</a>, whose width is specified by
+ the operand's extra data.</li>
+<li value="2">VBR: The field should be emitted as
+ a <a href="#variablewidth">variable-width value</a>, whose width is
+ specified by the operand's extra data.</li>
+<li value="3">Array: This field is an array of values. The array operand
+ has no extra data, but expects another operand to follow it which indicates
+ the element type of the array. When reading an array in an abbreviated
+ record, the first integer is a vbr6 that indicates the array length,
+ followed by the encoded elements of the array. An array may only occur as
+ the last operand of an abbreviation (except for the one final operand that
+ gives the array's type).</li>
+<li value="4">Char6: This field should be emitted as
+ a <a href="#char6">char6-encoded value</a>. This operand type takes no
+ extra data.</li>
+<li value="5">Blob: This field is emitted as a vbr6, followed by padding to a
+ 32-bit boundary (for alignment) and an array of 8-bit objects. The array of
+ bytes is further followed by tail padding to ensure that its total length is
+ a multiple of 4 bytes. This makes it very efficient for the reader to
+ decode the data without having to make a copy of it: it can use a pointer to
+ the data in the mapped in file and poke directly at it. A blob may only
+ occur as the last operand of an abbreviation.</li>
+</ol>
+
+<p>
+For example, target triples in LLVM modules are encoded as a record of the
+form <tt>[TRIPLE, 'a', 'b', 'c', 'd']</tt>. Consider if the bitstream emitted
+the following abbrev entry:
+</p>
+
+<div class="doc_code">
+<pre>
+[0, Fixed, 4]
+[0, Array]
+[0, Char6]
+</pre>
+</div>
+
+<p>
+When emitting a record with this abbreviation, the above entry would be emitted
+as:
+</p>
+
+<div class="doc_code">
+<p>
+<tt>[4<sub>abbrevwidth</sub>, 2<sub>4</sub>, 4<sub>vbr6</sub>, 0<sub>6</sub>,
+1<sub>6</sub>, 2<sub>6</sub>, 3<sub>6</sub>]</tt>
+</p>
+</div>
+
+<p>These values are:</p>
+
+<ol>
+<li>The first value, 4, is the abbreviation ID for this abbreviation.</li>
+<li>The second value, 2, is the code for <tt>TRIPLE</tt> in LLVM IR files.</li>
+<li>The third value, 4, is the length of the array.</li>
+<li>The rest of the values are the char6 encoded values
+ for <tt>"abcd"</tt>.</li>
+</ol>
+
+<p>
+With this abbreviation, the triple is emitted with only 37 bits (assuming a
+abbrev id width of 3). Without the abbreviation, significantly more space would
+be required to emit the target triple. Also, because the <tt>TRIPLE</tt> value
+is not emitted as a literal in the abbreviation, the abbreviation can also be
+used for any other string value.
+</p>