From: Chris Lattner Date: Fri, 22 Jul 2011 21:34:12 +0000 (+0000) Subject: write the long-overdue strings section of the data structure guide. X-Git-Url: http://plrg.eecs.uci.edu/git/?a=commitdiff_plain;h=3b4f4179cd5de4d37ace1879b240cd47a61d51e0;p=oota-llvm.git write the long-overdue strings section of the data structure guide. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@135809 91177308-0d34-0410-b5e6-96231b3b80d8 --- diff --git a/docs/ProgrammersManual.html b/docs/ProgrammersManual.html index 70eaddf7256..4ce446e9189 100644 --- a/docs/ProgrammersManual.html +++ b/docs/ProgrammersManual.html @@ -876,6 +876,9 @@ elements (but could contain many), for example, it's much better to use . Doing so avoids (relatively) expensive malloc/free calls, which dwarf the cost of adding the elements to the container.

+ + +

Sequential Containers (std::vector, std::list, etc) @@ -884,7 +887,7 @@ cost of adding the elements to the container.

There are a variety of sequential containers available for you, based on your needs. Pick the first in this section that will do what you want. - +

llvm/ADT/ArrayRef.h @@ -943,8 +946,6 @@ type, and 2) it cannot hold a null pointer.

-
-

"llvm/ADT/SmallVector.h" @@ -1209,7 +1210,6 @@ std::priority_queue, std::stack, etc. These provide simplified access to an underlying container but don't affect the cost of the container itself.

- @@ -1220,12 +1220,176 @@ underlying container but don't affect the cost of the container itself.

-TODO: const char* vs stringref vs smallstring vs std::string. Describe twine, -xref to #string_apis. +There are a variety of ways to pass around and use strings in C and C++, and +LLVM adds a few new options to choose from. Pick the first option on this list +that will do what you need, they are ordered according to their relative cost. +

+

+Note that is is generally preferred to not pass strings around as +"const char*"'s. These have a number of problems, including the fact +that they cannot represent embedded nul ("\0") characters, and do not have a +length available efficiently. The general replacement for 'const +char*' is StringRef.

+ +

For more information on choosing string containers for APIs, please see +Passing strings.

+ + + +

+ llvm/ADT/StringRef.h +

+
+

+The StringRef class is a simple value class that contains a pointer to a +character and a length, and is quite related to the ArrayRef class (but specialized for arrays of +characters). Because StringRef carries a length with it, it safely handles +strings with embedded nul characters in it, getting the length does not require +a strlen call, and it even has very convenient APIs for slicing and dicing the +character range that it represents. +

+ +

+StringRef is ideal for passing simple strings around that are known to be live, +either because they are C string literals, std::string, a C array, or a +SmallVector. Each of these cases has an efficient implicit conversion to +StringRef, which doesn't result in a dynamic strlen being executed. +

+ +

StringRef has a few major limitations which make more powerful string +containers useful:

+ +
    +
  1. You cannot directly convert a StringRef to a 'const char*' because there is +no way to add a trailing nul (unlike the .c_str() method on various stronger +classes).
  2. + + +
  3. StringRef doesn't own or keep alive the underlying string bytes. +As such it can easily lead to dangling pointers, and is not suitable for +embedding in datastructures in most cases (instead, use an std::string or +something like that).
  4. + +
  5. For the same reason, StringRef cannot be used as the return value of a +method if the method "computes" the result string. Instead, use +std::string.
  6. + +
  7. StringRef's allow you to mutate the pointed-to string bytes, but because it +doesn't own the string, it doesn't allow you to insert or remove bytes from +the range. For editing operations like this, it interoperates with the +Twine class.
  8. +
+ +

Because of its strengths and limitations, it is very common for a function to +take a StringRef and for a method on an object to return a StringRef that +points into some string that it owns.

+
- + + +

+ llvm/ADT/Twine.h +

+ +
+

+ The Twine class is used as an intermediary datatype for APIs that want to take + a string that can be constructed inline with a series of concatenations. + Twine works by forming recursive instances of the Twine datatype (a simple + value object) on the stack as temporary objects, linking them together into a + tree which is then linearized when the Twine is consumed. Twine is only safe + to use as the argument to a function, and should always be a const reference, + e.g.: +

+ +
+    void foo(const Twine &T);
+    ...
+    StringRef X = ...
+    unsigned i = ...
+    foo(X + "." + Twine(i));
+  
+ +

This example forms a string like "blarg.42" by concatenating the values + together, and does not form intermediate strings containing "blarg" or + "blarg.". +

+ +

Because Twine is constructed with temporary objects on the stack, and + because these instances are destroyed at the end of the current statement, + it is an inherently dangerous API. For example, this simple variant contains + undefined behavior and will probably crash:

+ +
+    void foo(const Twine &T);
+    ...
+    StringRef X = ...
+    unsigned i = ...
+    const Twine &Tmp = X + "." + Twine(i);
+    foo(Tmp);
+  
+ +

... because the temporaries are destroyed before the call. That said, + Twine's are much more efficient than intermediate std::string temporaries, and + they work really well with StringRef. Just be aware of their limitations.

+ +
+ + + +

+ llvm/ADT/SmallString.h +

+ +
+ +

SmallString is a subclass of SmallVector that +adds some convenience APIs like += that takes StringRef's. SmallString avoids +allocating memory in the case when the preallocated space is enough to hold its +data, and it calls back to general heap allocation when required. Since it owns +its data, it is very safe to use and supports full mutation of the string.

+ +

Like SmallVector's, the big downside to SmallString is their sizeof. While +they are optimized for small strings, they themselves are not particularly +small. This means that they work great for temporary scratch buffers on the +stack, but should not generally be put into the heap: it is very rare to +see a SmallString as the member of a frequently-allocated heap data structure +or returned by-value. +

+ +
+ + +

+ std::string +

+ +
+ +

The standard C++ std::string class is a very general class that (like + SmallString) owns its underlying data. sizeof(std::string) is very reasonable + so it can be embedded into heap data structures and returned by-value. + On the other hand, std::string is highly inefficient for inline editing (e.g. + concatenating a bunch of stuff together) and because it is provided by the + standard library, its performance characteristics depend a lot of the host + standard library (e.g. libc++ and MSVC provide a highly optimized string + class, GCC contains a really slow implementation). +

+ +

The major disadvantage of std::string is that almost every operation that + makes them larger can allocate memory, which is slow. As such, it is better + to use SmallVector or Twine as a scratch buffer, but then use std::string to + persist the result.

+ + +
+ + +
+